Package eu.svjatoslav.commons.string.tokenizer


package eu.svjatoslav.commons.string.tokenizer
Provides a regex-based tokenizer for parsing structured text.

This package contains a flexible tokenizer system for breaking down text into tokens based on regular expression patterns:

  • Tokenizer - Main tokenizer class that processes source text and extracts tokens
  • Terminator - Defines token boundaries using regex patterns with configurable handling strategies
  • TokenizerMatch - Result object containing matched token and metadata
  • InvalidSyntaxException - Exception thrown when parsing fails

The tokenizer supports two termination strategies:

  • PRESERVE - Return matched tokens for processing
  • DROP - Silently drop matched tokens (useful for whitespace/comments)

Example usage:


 Tokenizer tokenizer = new Tokenizer("hello, world!");
 tokenizer.addTerminator(DROP, "\\s+");        // Drop whitespace
 tokenizer.addTerminator(PRESERVE, ",");       // Preserve commas
 while (tokenizer.hasMoreContent()) {
     TokenizerMatch match = tokenizer.getNextToken();
     System.out.println(match.token);
 }
 
Since:
1.0
Author:
Svjatoslav Agejenko
See Also:
  • Class
    Description
    Exception thrown when token parsing encounters unexpected content.
    Defines a token boundary using a regular expression pattern.
    Defines how matched tokens are handled by the tokenizer.
    A regex-based tokenizer for parsing structured text into tokens.
    Represents a matched token from the tokenizer.