Class Tokenizer

java.lang.Object
eu.svjatoslav.commons.string.tokenizer.Tokenizer

public class Tokenizer extends Object
A regex-based tokenizer for parsing structured text into tokens.

The Tokenizer breaks down source text into tokens based on regular expression patterns called "terminators". Terminators define how to identify and handle token boundaries:

  • PRESERVE - Return matched tokens for processing (useful for syntax elements you want to capture)
  • DROP - Silently discard matched tokens (useful for whitespace, comments, or other separators)

Key features:

  • Pattern-based token identification using regex
  • Peek ahead without consuming tokens
  • Unread tokens to backtrack
  • Expect specific tokens (throws on mismatch)
  • Group-based token categorization

Example usage:


 Tokenizer tokenizer = new Tokenizer("hello, world! 123");
 tokenizer.addTerminator(DROP, "\\s+");        // Drop whitespace
 tokenizer.addTerminator(PRESERVE, "\\w+");    // Preserve words
 tokenizer.addTerminator(PRESERVE, ",");       // Preserve comma
 tokenizer.addTerminator(PRESERVE, "!");       // Preserve exclamation
 tokenizer.addTerminator(PRESERVE, "\\d+");    // Preserve numbers

 while (tokenizer.hasMoreContent()) {
     TokenizerMatch match = tokenizer.getNextToken();
     System.out.println(match.token);
 }
 // Output: hello, world, !, 123
 

The tokenizer maintains a history stack, allowing you to unread tokens and backtrack during parsing:


 TokenizerMatch first = tokenizer.getNextToken();
 TokenizerMatch second = tokenizer.getNextToken();
 tokenizer.unreadToken();  // Go back one token
 tokenizer.unreadToken();  // Go back another token
 TokenizerMatch again = tokenizer.getNextToken();  // Same as first
 

You can also peek without consuming:


 TokenizerMatch peeked = tokenizer.peekNextToken();  // Look ahead
 TokenizerMatch actual = tokenizer.getNextToken();   // Same as peeked
 
See Also:
  • Constructor Details

    • Tokenizer

      public Tokenizer(String source)
      Creates a new tokenizer for the specified source string.

      The source string will be processed when getNextToken() is called. Add terminators before calling getNextToken() to define how tokens should be identified.

      Parameters:
      source - the text to tokenize. May be null (use setSource later).
    • Tokenizer

      public Tokenizer()
      Creates an empty tokenizer without a source string.

      Use setSource(String) to provide text for tokenization before calling getNextToken().

  • Method Details

    • setSource

      public Tokenizer setSource(String source)
      Sets or replaces the source string to tokenize.

      This resets the tokenizer state: the reading position is set to 0, and the token history stack is cleared. Use this to tokenize a new string with the same terminator configuration.

      Example:

      
       Tokenizer tokenizer = new Tokenizer();
       tokenizer.addTerminator(DROP, "\\s+");
      
       tokenizer.setSource("first string");
       // tokenize first string...
      
       tokenizer.setSource("second string");
       // tokenize second string with same rules...
       
      Parameters:
      source - the new text to tokenize. May be null.
      Returns:
      this Tokenizer instance for fluent method chaining.
    • addTerminator

      public Terminator addTerminator(Terminator.TerminationStrategy terminationStrategy, String regexp)
      Adds a terminator with a termination strategy and regex pattern.

      The terminator will match tokens based on the regex pattern. The termination strategy determines whether matched tokens are preserved (returned) or dropped (silently discarded).

      The pattern is anchored to match only at the current position (prepended with "^").

      Parameters:
      terminationStrategy - how to handle matched tokens (PRESERVE or DROP).
      regexp - the regex pattern to match tokens.
      Returns:
      the created Terminator object, which can be further configured (e.g., setting the active flag or group).
    • addTerminator

      public Terminator addTerminator(Terminator.TerminationStrategy terminationStrategy, String regexp, String group)
      Adds a terminator with a termination strategy, regex pattern, and group name.

      The group name allows categorizing tokens by type, which can be checked using TokenizerMatch.isGroup(String).

      Example:

      
       tokenizer.addTerminator(PRESERVE, "\\d+", "number");
       tokenizer.addTerminator(PRESERVE, "\\w+", "word");
      
       TokenizerMatch match = tokenizer.getNextToken();
       if (match.isGroup("number")) {
           // Handle number token...
       }
       
      Parameters:
      terminationStrategy - how to handle matched tokens (PRESERVE or DROP).
      regexp - the regex pattern to match tokens.
      group - the group name for categorizing this token type. May be null.
      Returns:
      the created Terminator object.
    • addTerminator

      public Terminator addTerminator(Terminator terminator)
      Adds a pre-configured terminator to this tokenizer.

      Use this when you need to create a Terminator with custom configuration before adding it.

      Parameters:
      terminator - the terminator to add. Must not be null.
      Returns:
      the same terminator that was added.
    • expectAndConsumeNextStringToken

      public void expectAndConsumeNextStringToken(String value) throws InvalidSyntaxException
      Consumes the next token and verifies it matches the expected value.

      This is a convenience method for parsing where you expect a specific token at a specific position. If the token doesn't match, an exception is thrown.

      Example:

      
       tokenizer.expectAndConsumeNextStringToken("if");
       // Consumes "if" token, throws if next token is not "if"
       
      Parameters:
      value - the expected token value. Must not be null.
      Throws:
      InvalidSyntaxException - if the next token does not match the expected value.
    • expectAndConsumeNextTerminatorToken

      public TokenizerMatch expectAndConsumeNextTerminatorToken(Terminator terminator) throws InvalidSyntaxException
      Consumes the next token and verifies it was matched by the expected terminator.

      This is useful when you need to ensure a specific terminator matched the token, not just that the token has a certain value.

      Example:

      
       Terminator stringTerminator = tokenizer.addTerminator(PRESERVE, "\".*\"");
       tokenizer.expectAndConsumeNextTerminatorToken(stringTerminator);
       
      Parameters:
      terminator - the expected terminator that should have matched.
      Returns:
      the TokenizerMatch containing the matched token.
      Throws:
      InvalidSyntaxException - if the next token was matched by a different terminator.
    • getNextToken

      public TokenizerMatch getNextToken()
      Returns the next token from the source string.

      This method advances the reading position. The token is identified based on the configured terminators:

      • If a PRESERVE terminator matches, that matched text is returned
      • If a DROP terminator matches, it is discarded and the next token is sought
      • If no terminator matches, characters accumulate until a terminator matches

      Example:

      
       TokenizerMatch match = tokenizer.getNextToken();
       if (match != null) {
           System.out.println(match.token);
       }
       
      Returns:
      the next TokenizerMatch, or null if the end of the source string is reached.
    • findTerminatorMatch

      public TokenizerMatch findTerminatorMatch()
      Finds a terminator that matches at the current position.

      This checks all active terminators (in order) to see if any matches at the current index. The first matching terminator is returned.

      Terminators with active = false are skipped.

      Returns:
      a TokenizerMatch if a terminator matches, or null if no terminator matches at the current position.
    • hasMoreContent

      public boolean hasMoreContent()
      Checks if there is more content to read.

      Returns true if the current position is before the end of the source string. Note that even if this returns true, getNextToken() might return null if remaining content is dropped by terminators.

      Returns:
      true if there is more content, false if at the end of source or source is null.
    • consumeIfNextToken

      public boolean consumeIfNextToken(String token) throws InvalidSyntaxException
      Consumes the next token if it matches the expected value.

      If the next token matches, it is consumed and true is returned. If it doesn't match, the token is unread and false is returned.

      Example:

      
       if (tokenizer.consumeIfNextToken("else")) {
           // Handle else clause
       } else {
           // Token was not "else", position unchanged
       }
       
      Parameters:
      token - the expected token value. Must not be null.
      Returns:
      true if the next token matched and was consumed, false otherwise (position unchanged).
      Throws:
      InvalidSyntaxException - if parsing fails.
    • peekNextToken

      public TokenizerMatch peekNextToken() throws InvalidSyntaxException
      Returns the next token without consuming it.

      This looks ahead at the next token and returns it, then immediately unread to restore the position. Use this to examine what's coming without advancing.

      Example:

      
       TokenizerMatch peeked = tokenizer.peekNextToken();
       System.out.println("Next will be: " + peeked.token);
       TokenizerMatch actual = tokenizer.getNextToken();  // Same as peeked
       
      Returns:
      the next TokenizerMatch without advancing the position.
      Throws:
      InvalidSyntaxException - if parsing fails.
    • peekIsOneOf

      public boolean peekIsOneOf(String... possibilities) throws InvalidSyntaxException
      Checks if the next token is one of the specified possibilities.

      This peeks at the next token and checks if its value equals any of the given strings. The position is unchanged after this call.

      Example:

      
       if (tokenizer.peekIsOneOf("if", "else", "while")) {
           // Next token is a control keyword
       }
       
      Parameters:
      possibilities - the token values to check against. Must not be null or empty.
      Returns:
      true if the next token matches one of the possibilities, false otherwise.
      Throws:
      InvalidSyntaxException - if parsing fails.
    • peekExpectNoneOf

      public void peekExpectNoneOf(String... possibilities) throws InvalidSyntaxException
      Verifies the next token is NOT one of the specified possibilities.

      If the next token matches any possibility, an exception is thrown. Use this for negative assertions in parsing.

      Example:

      
       tokenizer.peekExpectNoneOf("", "end");
       // Throws if next token is } or end
       }
      Parameters:
      possibilities - the token values that should NOT appear next.
      Throws:
      InvalidSyntaxException - if the next token matches any possibility.
    • unreadToken

      public void unreadToken()
      Unreads the most recently consumed token.

      This restores the reading position to before the last token was read. The token can be read again with getNextToken().

      You can unread multiple times to backtrack further:

      
       TokenizerMatch first = tokenizer.getNextToken();
       TokenizerMatch second = tokenizer.getNextToken();
       TokenizerMatch third = tokenizer.getNextToken();
      
       tokenizer.unreadToken();  // Back to after second
       tokenizer.unreadToken();  // Back to after first
      
       TokenizerMatch again = tokenizer.getNextToken();  // Same as second
       
    • enlistRemainingTokens

      public void enlistRemainingTokens()
      Prints all remaining tokens for debugging purposes.

      This reads and prints all remaining tokens without permanently consuming them. After printing, the position is restored to the original location.

      Output is printed to stdout with each token on a new line.

    • skipUntilDataEnd

      public void skipUntilDataEnd()
      Skips to the end of the source string without consuming tokens.

      This advances directly to the end, skipping all remaining content. After calling this, hasMoreContent() will return false.

      The current position is saved on the stack, so you can unread to restore it if needed.