Package eu.svjatoslav.commons.string.tokenizer
package eu.svjatoslav.commons.string.tokenizer
Provides a regex-based tokenizer for parsing structured text.
This package contains a flexible tokenizer system for breaking down text into tokens based on regular expression patterns:
Tokenizer- Main tokenizer class that processes source text and extracts tokensTerminator- Defines token boundaries using regex patterns with configurable handling strategiesTokenizerMatch- Result object containing matched token and metadataInvalidSyntaxException- Exception thrown when parsing fails
The tokenizer supports two termination strategies:
PRESERVE- Return matched tokens for processingDROP- Silently drop matched tokens (useful for whitespace/comments)
Example usage:
Tokenizer tokenizer = new Tokenizer("hello, world!");
tokenizer.addTerminator(DROP, "\\s+"); // Drop whitespace
tokenizer.addTerminator(PRESERVE, ","); // Preserve commas
while (tokenizer.hasMoreContent()) {
TokenizerMatch match = tokenizer.getNextToken();
System.out.println(match.token);
}
- Since:
- 1.0
- Author:
- Svjatoslav Agejenko
- See Also:
-
ClassDescriptionException thrown when token parsing encounters unexpected content.Defines a token boundary using a regular expression pattern.Defines how matched tokens are handled by the tokenizer.A regex-based tokenizer for parsing structured text into tokens.Represents a matched token from the tokenizer.