java.lang.Object

eu.svjatoslav.commons.string.tokenizer.Tokenizer

public class Tokenizer extends Object

A regex-based tokenizer for parsing structured text into tokens.

The Tokenizer breaks down source text into tokens based on regular expression patterns called "terminators". Terminators define how to identify and handle token boundaries:

PRESERVE - Return matched tokens for processing (useful for syntax elements you want to capture)
DROP - Silently discard matched tokens (useful for whitespace, comments, or other separators)

Key features:

Pattern-based token identification using regex
Peek ahead without consuming tokens
Unread tokens to backtrack
Expect specific tokens (throws on mismatch)
Group-based token categorization

Example usage:


 Tokenizer tokenizer = new Tokenizer("hello, world! 123");
 tokenizer.addTerminator(DROP, "\\s+");        // Drop whitespace
 tokenizer.addTerminator(PRESERVE, "\\w+");    // Preserve words
 tokenizer.addTerminator(PRESERVE, ",");       // Preserve comma
 tokenizer.addTerminator(PRESERVE, "!");       // Preserve exclamation
 tokenizer.addTerminator(PRESERVE, "\\d+");    // Preserve numbers

 while (tokenizer.hasMoreContent()) {
     TokenizerMatch match = tokenizer.getNextToken();
     System.out.println(match.token);
 }
 // Output: hello, world, !, 123

The tokenizer maintains a history stack, allowing you to unread tokens and backtrack during parsing:


 TokenizerMatch first = tokenizer.getNextToken();
 TokenizerMatch second = tokenizer.getNextToken();
 tokenizer.unreadToken();  // Go back one token
 tokenizer.unreadToken();  // Go back another token
 TokenizerMatch again = tokenizer.getNextToken();  // Same as first

You can also peek without consuming:


 TokenizerMatch peeked = tokenizer.peekNextToken();  // Look ahead
 TokenizerMatch actual = tokenizer.getNextToken();   // Same as peeked

See Also:

Constructor Summary

Constructors

Constructor

Description

Tokenizer()

Creates an empty tokenizer without a source string.

Tokenizer(String source)

Creates a new tokenizer for the specified source string.
Method Summary

Modifier and Type

Method

Description

Terminator

addTerminator(Terminator terminator)

Adds a pre-configured terminator to this tokenizer.

Terminator

addTerminator(Terminator.TerminationStrategy terminationStrategy, String regexp)

Adds a terminator with a termination strategy and regex pattern.

Terminator

addTerminator(Terminator.TerminationStrategy terminationStrategy, String regexp, String group)

Adds a terminator with a termination strategy, regex pattern, and group name.

boolean

consumeIfNextToken(String token)

Consumes the next token if it matches the expected value.

void

enlistRemainingTokens()

Prints all remaining tokens for debugging purposes.

void

expectAndConsumeNextStringToken(String value)

Consumes the next token and verifies it matches the expected value.

TokenizerMatch

expectAndConsumeNextTerminatorToken(Terminator terminator)

Consumes the next token and verifies it was matched by the expected terminator.

TokenizerMatch

findTerminatorMatch()

Finds a terminator that matches at the current position.

TokenizerMatch

getNextToken()

Returns the next token from the source string.

boolean

hasMoreContent()

Checks if there is more content to read.

void

peekExpectNoneOf(String... possibilities)

Verifies the next token is NOT one of the specified possibilities.

boolean

peekIsOneOf(String... possibilities)

Checks if the next token is one of the specified possibilities.

TokenizerMatch

peekNextToken()

Returns the next token without consuming it.

Tokenizer

setSource(String source)

Sets or replaces the source string to tokenize.

void

skipUntilDataEnd()

Skips to the end of the source string without consuming tokens.

void

unreadToken()

Unreads the most recently consumed token.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- Tokenizer
  
  public Tokenizer(String source)
  
  Creates a new tokenizer for the specified source string.
  The source string will be processed when getNextToken() is called. Add terminators before calling getNextToken() to define how tokens should be identified.
  
  Parameters:
  
  source - the text to tokenize. May be null (use setSource later).
- Tokenizer
  
  public Tokenizer()
  
  Creates an empty tokenizer without a source string.
  Use setSource(String) to provide text for tokenization before calling getNextToken().
Method Details
- setSource
  
  public Tokenizer setSource(String source)
  Sets or replaces the source string to tokenize.
  This resets the tokenizer state: the reading position is set to 0, and the token history stack is cleared. Use this to tokenize a new string with the same terminator configuration.
  
  Example:
  
  Tokenizer tokenizer = new Tokenizer(); tokenizer.addTerminator(DROP, "\\s+"); tokenizer.setSource("first string"); // tokenize first string... tokenizer.setSource("second string"); // tokenize second string with same rules...
  Parameters:
  
  source - the new text to tokenize. May be null.
  
  Returns:
  
  this Tokenizer instance for fluent method chaining.
- addTerminator
  
  public Terminator addTerminator(Terminator.TerminationStrategy terminationStrategy, String regexp)
  
  Adds a terminator with a termination strategy and regex pattern.
  The terminator will match tokens based on the regex pattern. The termination strategy determines whether matched tokens are preserved (returned) or dropped (silently discarded).
  
  The pattern is anchored to match only at the current position (prepended with "^").
  
  Parameters:
  
  terminationStrategy - how to handle matched tokens (PRESERVE or DROP).
  
  regexp - the regex pattern to match tokens.
  
  Returns:
  
  the created Terminator object, which can be further configured (e.g., setting the active flag or group).
- addTerminator
  
  public Terminator addTerminator(Terminator.TerminationStrategy terminationStrategy, String regexp, String group)
  Adds a terminator with a termination strategy, regex pattern, and group name.
  The group name allows categorizing tokens by type, which can be checked using TokenizerMatch.isGroup(String).
  
  Example:
  
  tokenizer.addTerminator(PRESERVE, "\\d+", "number"); tokenizer.addTerminator(PRESERVE, "\\w+", "word"); TokenizerMatch match = tokenizer.getNextToken(); if (match.isGroup("number")) { // Handle number token... }
  Parameters:
  
  terminationStrategy - how to handle matched tokens (PRESERVE or DROP).
  
  regexp - the regex pattern to match tokens.
  
  group - the group name for categorizing this token type. May be null.
  
  Returns:
  
  the created Terminator object.
- addTerminator
  
  public Terminator addTerminator(Terminator terminator)
  
  Adds a pre-configured terminator to this tokenizer.
  Use this when you need to create a Terminator with custom configuration before adding it.
  
  Parameters:
  
  terminator - the terminator to add. Must not be null.
  
  Returns:
  
  the same terminator that was added.
- expectAndConsumeNextStringToken
  
  public void expectAndConsumeNextStringToken(String value) throws InvalidSyntaxException
  Consumes the next token and verifies it matches the expected value.
  This is a convenience method for parsing where you expect a specific token at a specific position. If the token doesn't match, an exception is thrown.
  
  Example:
  
  tokenizer.expectAndConsumeNextStringToken("if"); // Consumes "if" token, throws if next token is not "if"
  Parameters:
  
  value - the expected token value. Must not be null.
  
  Throws:
  
  InvalidSyntaxException - if the next token does not match the expected value.
- expectAndConsumeNextTerminatorToken
  
  public TokenizerMatch expectAndConsumeNextTerminatorToken(Terminator terminator) throws InvalidSyntaxException
  Consumes the next token and verifies it was matched by the expected terminator.
  This is useful when you need to ensure a specific terminator matched the token, not just that the token has a certain value.
  
  Example:
  
  Terminator stringTerminator = tokenizer.addTerminator(PRESERVE, "\".*\""); tokenizer.expectAndConsumeNextTerminatorToken(stringTerminator);
  Parameters:
  
  terminator - the expected terminator that should have matched.
  
  Returns:
  
  the TokenizerMatch containing the matched token.
  
  Throws:
  
  InvalidSyntaxException - if the next token was matched by a different terminator.
- getNextToken
  
  public TokenizerMatch getNextToken()
  Returns the next token from the source string.
  This method advances the reading position. The token is identified based on the configured terminators:
  
  If a PRESERVE terminator matches, that matched text is returned
  
  If a DROP terminator matches, it is discarded and the next token is sought
  
  If no terminator matches, characters accumulate until a terminator matches
  
  Example:
  
  TokenizerMatch match = tokenizer.getNextToken(); if (match != null) { System.out.println(match.token); }
  Returns:
  
  the next TokenizerMatch, or null if the end of the source string is reached.
- findTerminatorMatch
  
  public TokenizerMatch findTerminatorMatch()
  
  Finds a terminator that matches at the current position.
  This checks all active terminators (in order) to see if any matches at the current index. The first matching terminator is returned.
  
  Terminators with active = false are skipped.
  
  Returns:
  
  a TokenizerMatch if a terminator matches, or null if no terminator matches at the current position.
- hasMoreContent
  
  public boolean hasMoreContent()
  
  Checks if there is more content to read.
  Returns true if the current position is before the end of the source string. Note that even if this returns true, getNextToken() might return null if remaining content is dropped by terminators.
  
  Returns:
  
  true if there is more content, false if at the end of source or source is null.
- consumeIfNextToken
  
  public boolean consumeIfNextToken(String token) throws InvalidSyntaxException
  Consumes the next token if it matches the expected value.
  If the next token matches, it is consumed and true is returned. If it doesn't match, the token is unread and false is returned.
  
  Example:
  
  if (tokenizer.consumeIfNextToken("else")) { // Handle else clause } else { // Token was not "else", position unchanged }
  Parameters:
  
  token - the expected token value. Must not be null.
  
  Returns:
  
  true if the next token matched and was consumed, false otherwise (position unchanged).
  
  Throws:
  
  InvalidSyntaxException - if parsing fails.
- peekNextToken
  
  public TokenizerMatch peekNextToken() throws InvalidSyntaxException
  Returns the next token without consuming it.
  This looks ahead at the next token and returns it, then immediately unread to restore the position. Use this to examine what's coming without advancing.
  
  Example:
  
  TokenizerMatch peeked = tokenizer.peekNextToken(); System.out.println("Next will be: " + peeked.token); TokenizerMatch actual = tokenizer.getNextToken(); // Same as peeked
  Returns:
  
  the next TokenizerMatch without advancing the position.
  
  Throws:
  
  InvalidSyntaxException - if parsing fails.
- peekIsOneOf
  
  public boolean peekIsOneOf(String... possibilities) throws InvalidSyntaxException
  Checks if the next token is one of the specified possibilities.
  This peeks at the next token and checks if its value equals any of the given strings. The position is unchanged after this call.
  
  Example:
  
  if (tokenizer.peekIsOneOf("if", "else", "while")) { // Next token is a control keyword }
  Parameters:
  
  possibilities - the token values to check against. Must not be null or empty.
  
  Returns:
  
  true if the next token matches one of the possibilities, false otherwise.
  
  Throws:
  
  InvalidSyntaxException - if parsing fails.
- peekExpectNoneOf
  
  public void peekExpectNoneOf(String... possibilities) throws InvalidSyntaxException
  Verifies the next token is NOT one of the specified possibilities.
  If the next token matches any possibility, an exception is thrown. Use this for negative assertions in parsing.
  
  Example:
  
  tokenizer.peekExpectNoneOf("", "end"); // Throws if next token is } or end }
  Parameters:
  
  possibilities - the token values that should NOT appear next.
  
  Throws:
  
  InvalidSyntaxException - if the next token matches any possibility.
- unreadToken
  
  public void unreadToken()
  Unreads the most recently consumed token.
  This restores the reading position to before the last token was read. The token can be read again with getNextToken().
  
  You can unread multiple times to backtrack further:
  
  TokenizerMatch first = tokenizer.getNextToken(); TokenizerMatch second = tokenizer.getNextToken(); TokenizerMatch third = tokenizer.getNextToken(); tokenizer.unreadToken(); // Back to after second tokenizer.unreadToken(); // Back to after first TokenizerMatch again = tokenizer.getNextToken(); // Same as second
- enlistRemainingTokens
  
  public void enlistRemainingTokens()
  
  Prints all remaining tokens for debugging purposes.
  This reads and prints all remaining tokens without permanently consuming them. After printing, the position is restored to the original location.
  
  Output is printed to stdout with each token on a new line.
- skipUntilDataEnd
  
  public void skipUntilDataEnd()
  
  Skips to the end of the source string without consuming tokens.
  This advances directly to the end, skipping all remaining content. After calling this, hasMoreContent() will return false.
  
  The current position is saved on the stack, so you can unread to restore it if needed.

Class Tokenizer

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

Tokenizer

Tokenizer

Method Details

setSource

addTerminator

addTerminator

addTerminator

expectAndConsumeNextStringToken

expectAndConsumeNextTerminatorToken

getNextToken

findTerminatorMatch

hasMoreContent

consumeIfNextToken

peekNextToken

peekIsOneOf

peekExpectNoneOf

unreadToken

enlistRemainingTokens

skipUntilDataEnd