Class Tokenizer
The Tokenizer breaks down source text into tokens based on regular expression patterns called "terminators". Terminators define how to identify and handle token boundaries:
PRESERVE- Return matched tokens for processing (useful for syntax elements you want to capture)DROP- Silently discard matched tokens (useful for whitespace, comments, or other separators)
Key features:
- Pattern-based token identification using regex
- Peek ahead without consuming tokens
- Unread tokens to backtrack
- Expect specific tokens (throws on mismatch)
- Group-based token categorization
Example usage:
Tokenizer tokenizer = new Tokenizer("hello, world! 123");
tokenizer.addTerminator(DROP, "\\s+"); // Drop whitespace
tokenizer.addTerminator(PRESERVE, "\\w+"); // Preserve words
tokenizer.addTerminator(PRESERVE, ","); // Preserve comma
tokenizer.addTerminator(PRESERVE, "!"); // Preserve exclamation
tokenizer.addTerminator(PRESERVE, "\\d+"); // Preserve numbers
while (tokenizer.hasMoreContent()) {
TokenizerMatch match = tokenizer.getNextToken();
System.out.println(match.token);
}
// Output: hello, world, !, 123
The tokenizer maintains a history stack, allowing you to unread tokens and backtrack during parsing:
TokenizerMatch first = tokenizer.getNextToken();
TokenizerMatch second = tokenizer.getNextToken();
tokenizer.unreadToken(); // Go back one token
tokenizer.unreadToken(); // Go back another token
TokenizerMatch again = tokenizer.getNextToken(); // Same as first
You can also peek without consuming:
TokenizerMatch peeked = tokenizer.peekNextToken(); // Look ahead
TokenizerMatch actual = tokenizer.getNextToken(); // Same as peeked
- See Also:
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionaddTerminator(Terminator terminator) Adds a pre-configured terminator to this tokenizer.addTerminator(Terminator.TerminationStrategy terminationStrategy, String regexp) Adds a terminator with a termination strategy and regex pattern.addTerminator(Terminator.TerminationStrategy terminationStrategy, String regexp, String group) Adds a terminator with a termination strategy, regex pattern, and group name.booleanconsumeIfNextToken(String token) Consumes the next token if it matches the expected value.voidPrints all remaining tokens for debugging purposes.voidConsumes the next token and verifies it matches the expected value.expectAndConsumeNextTerminatorToken(Terminator terminator) Consumes the next token and verifies it was matched by the expected terminator.Finds a terminator that matches at the current position.Returns the next token from the source string.booleanChecks if there is more content to read.voidpeekExpectNoneOf(String... possibilities) Verifies the next token is NOT one of the specified possibilities.booleanpeekIsOneOf(String... possibilities) Checks if the next token is one of the specified possibilities.Returns the next token without consuming it.Sets or replaces the source string to tokenize.voidSkips to the end of the source string without consuming tokens.voidUnreads the most recently consumed token.
-
Constructor Details
-
Tokenizer
Creates a new tokenizer for the specified source string.The source string will be processed when
getNextToken()is called. Add terminators before calling getNextToken() to define how tokens should be identified.- Parameters:
source- the text to tokenize. May be null (use setSource later).
-
Tokenizer
public Tokenizer()Creates an empty tokenizer without a source string.Use
setSource(String)to provide text for tokenization before callinggetNextToken().
-
-
Method Details
-
setSource
Sets or replaces the source string to tokenize.This resets the tokenizer state: the reading position is set to 0, and the token history stack is cleared. Use this to tokenize a new string with the same terminator configuration.
Example:
Tokenizer tokenizer = new Tokenizer(); tokenizer.addTerminator(DROP, "\\s+"); tokenizer.setSource("first string"); // tokenize first string... tokenizer.setSource("second string"); // tokenize second string with same rules...- Parameters:
source- the new text to tokenize. May be null.- Returns:
- this Tokenizer instance for fluent method chaining.
-
addTerminator
Adds a terminator with a termination strategy and regex pattern.The terminator will match tokens based on the regex pattern. The termination strategy determines whether matched tokens are preserved (returned) or dropped (silently discarded).
The pattern is anchored to match only at the current position (prepended with "^").
- Parameters:
terminationStrategy- how to handle matched tokens (PRESERVE or DROP).regexp- the regex pattern to match tokens.- Returns:
- the created Terminator object, which can be further configured (e.g., setting the active flag or group).
-
addTerminator
public Terminator addTerminator(Terminator.TerminationStrategy terminationStrategy, String regexp, String group) Adds a terminator with a termination strategy, regex pattern, and group name.The group name allows categorizing tokens by type, which can be checked using
TokenizerMatch.isGroup(String).Example:
tokenizer.addTerminator(PRESERVE, "\\d+", "number"); tokenizer.addTerminator(PRESERVE, "\\w+", "word"); TokenizerMatch match = tokenizer.getNextToken(); if (match.isGroup("number")) { // Handle number token... }- Parameters:
terminationStrategy- how to handle matched tokens (PRESERVE or DROP).regexp- the regex pattern to match tokens.group- the group name for categorizing this token type. May be null.- Returns:
- the created Terminator object.
-
addTerminator
Adds a pre-configured terminator to this tokenizer.Use this when you need to create a Terminator with custom configuration before adding it.
- Parameters:
terminator- the terminator to add. Must not be null.- Returns:
- the same terminator that was added.
-
expectAndConsumeNextStringToken
Consumes the next token and verifies it matches the expected value.This is a convenience method for parsing where you expect a specific token at a specific position. If the token doesn't match, an exception is thrown.
Example:
tokenizer.expectAndConsumeNextStringToken("if"); // Consumes "if" token, throws if next token is not "if"- Parameters:
value- the expected token value. Must not be null.- Throws:
InvalidSyntaxException- if the next token does not match the expected value.
-
expectAndConsumeNextTerminatorToken
public TokenizerMatch expectAndConsumeNextTerminatorToken(Terminator terminator) throws InvalidSyntaxException Consumes the next token and verifies it was matched by the expected terminator.This is useful when you need to ensure a specific terminator matched the token, not just that the token has a certain value.
Example:
Terminator stringTerminator = tokenizer.addTerminator(PRESERVE, "\".*\""); tokenizer.expectAndConsumeNextTerminatorToken(stringTerminator);- Parameters:
terminator- the expected terminator that should have matched.- Returns:
- the TokenizerMatch containing the matched token.
- Throws:
InvalidSyntaxException- if the next token was matched by a different terminator.
-
getNextToken
Returns the next token from the source string.This method advances the reading position. The token is identified based on the configured terminators:
- If a PRESERVE terminator matches, that matched text is returned
- If a DROP terminator matches, it is discarded and the next token is sought
- If no terminator matches, characters accumulate until a terminator matches
Example:
TokenizerMatch match = tokenizer.getNextToken(); if (match != null) { System.out.println(match.token); }- Returns:
- the next TokenizerMatch, or
nullif the end of the source string is reached.
-
findTerminatorMatch
Finds a terminator that matches at the current position.This checks all active terminators (in order) to see if any matches at the current index. The first matching terminator is returned.
Terminators with
active = falseare skipped.- Returns:
- a TokenizerMatch if a terminator matches, or
nullif no terminator matches at the current position.
-
hasMoreContent
public boolean hasMoreContent()Checks if there is more content to read.Returns true if the current position is before the end of the source string. Note that even if this returns true, getNextToken() might return null if remaining content is dropped by terminators.
- Returns:
trueif there is more content,falseif at the end of source or source is null.
-
consumeIfNextToken
Consumes the next token if it matches the expected value.If the next token matches, it is consumed and
trueis returned. If it doesn't match, the token is unread andfalseis returned.Example:
if (tokenizer.consumeIfNextToken("else")) { // Handle else clause } else { // Token was not "else", position unchanged }- Parameters:
token- the expected token value. Must not be null.- Returns:
trueif the next token matched and was consumed,falseotherwise (position unchanged).- Throws:
InvalidSyntaxException- if parsing fails.
-
peekNextToken
Returns the next token without consuming it.This looks ahead at the next token and returns it, then immediately unread to restore the position. Use this to examine what's coming without advancing.
Example:
TokenizerMatch peeked = tokenizer.peekNextToken(); System.out.println("Next will be: " + peeked.token); TokenizerMatch actual = tokenizer.getNextToken(); // Same as peeked- Returns:
- the next TokenizerMatch without advancing the position.
- Throws:
InvalidSyntaxException- if parsing fails.
-
peekIsOneOf
Checks if the next token is one of the specified possibilities.This peeks at the next token and checks if its value equals any of the given strings. The position is unchanged after this call.
Example:
if (tokenizer.peekIsOneOf("if", "else", "while")) { // Next token is a control keyword }- Parameters:
possibilities- the token values to check against. Must not be null or empty.- Returns:
trueif the next token matches one of the possibilities,falseotherwise.- Throws:
InvalidSyntaxException- if parsing fails.
-
peekExpectNoneOf
Verifies the next token is NOT one of the specified possibilities.If the next token matches any possibility, an exception is thrown. Use this for negative assertions in parsing.
Example:
tokenizer.peekExpectNoneOf("", "end"); // Throws if next token is } or end }- Parameters:
possibilities- the token values that should NOT appear next.- Throws:
InvalidSyntaxException- if the next token matches any possibility.
-
unreadToken
public void unreadToken()Unreads the most recently consumed token.This restores the reading position to before the last token was read. The token can be read again with getNextToken().
You can unread multiple times to backtrack further:
TokenizerMatch first = tokenizer.getNextToken(); TokenizerMatch second = tokenizer.getNextToken(); TokenizerMatch third = tokenizer.getNextToken(); tokenizer.unreadToken(); // Back to after second tokenizer.unreadToken(); // Back to after first TokenizerMatch again = tokenizer.getNextToken(); // Same as second -
enlistRemainingTokens
public void enlistRemainingTokens()Prints all remaining tokens for debugging purposes.This reads and prints all remaining tokens without permanently consuming them. After printing, the position is restored to the original location.
Output is printed to stdout with each token on a new line.
-
skipUntilDataEnd
public void skipUntilDataEnd()Skips to the end of the source string without consuming tokens.This advances directly to the end, skipping all remaining content. After calling this,
hasMoreContent()will returnfalse.The current position is saved on the stack, so you can unread to restore it if needed.
-