eu.svjatoslav.commons.string.tokenizer (Svjatoslav Commons 1.10-SNAPSHOT API)

package eu.svjatoslav.commons.string.tokenizer

Provides a regex-based tokenizer for parsing structured text.

This package contains a flexible tokenizer system for breaking down text into tokens based on regular expression patterns:

Tokenizer - Main tokenizer class that processes source text and extracts tokens
Terminator - Defines token boundaries using regex patterns with configurable handling strategies
TokenizerMatch - Result object containing matched token and metadata
InvalidSyntaxException - Exception thrown when parsing fails

The tokenizer supports two termination strategies:

PRESERVE - Return matched tokens for processing
DROP - Silently drop matched tokens (useful for whitespace/comments)

Example usage:


 Tokenizer tokenizer = new Tokenizer("hello, world!");
 tokenizer.addTerminator(DROP, "\\s+");        // Drop whitespace
 tokenizer.addTerminator(PRESERVE, ",");       // Preserve commas
 while (tokenizer.hasMoreContent()) {
     TokenizerMatch match = tokenizer.getNextToken();
     System.out.println(match.token);
 }

Since:

1.0

Author:

Svjatoslav Agejenko

See Also:

Related Packages

Package

Description

eu.svjatoslav.commons.string

Provides utility classes for string manipulation and pattern matching.
Class

Description

InvalidSyntaxException

Exception thrown when token parsing encounters unexpected content.

Terminator

Defines a token boundary using a regular expression pattern.

Terminator.TerminationStrategy

Defines how matched tokens are handled by the tokenizer.

Tokenizer

A regex-based tokenizer for parsing structured text into tokens.

TokenizerMatch

Represents a matched token from the tokenizer.

Package eu.svjatoslav.commons.string.tokenizer