TinyTokenizer 0.3.0

There is a newer version of this package available.
See the version list below for details.

dotnet add package TinyTokenizer --version 0.3.0

NuGet\Install-Package TinyTokenizer -Version 0.3.0

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="TinyTokenizer" Version="0.3.0" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="TinyTokenizer" Version="0.3.0" />
                    

                            Directory.Packages.props

<PackageReference Include="TinyTokenizer" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add TinyTokenizer --version 0.3.0

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: TinyTokenizer, 0.3.0"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package TinyTokenizer@0.3.0

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=TinyTokenizer&version=0.3.0
                    

                            Install as a Cake Addin

#tool nuget:?package=TinyTokenizer&version=0.3.0
                    

                            Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

TinyTokenizer

A high-performance, zero-allocation tokenizer library for .NET 8+ that parses text into abstract tokens using ReadOnlySpan<char> and SIMD-optimized SearchValues<char> for maximum efficiency.

Features

Zero-allocation parsing — Uses ReadOnlySpan<char> internally for fast, allocation-free text traversal
SIMD-optimized — Uses .NET 8's SearchValues<char> for vectorized character matching
Two-level architecture — Lexer (character classification) + TokenParser (semantic parsing)
Async streaming — Tokenize Stream and PipeReader sources with IAsyncEnumerable<Token>
Recursive declaration blocks — Automatically parses nested {}, [], and () blocks with child tokens
String literals — Supports single and double-quoted strings with escape sequences
Numeric literals — Parses integers and floating-point numbers
Comment support — Configurable single-line and multi-line comment styles
Operators — Configurable multi-character operators with greedy matching
Tagged identifiers — Configurable prefix characters for patterns like #define, @attribute, $variable
Configurable symbols — Define which characters are recognized as symbol tokens
Immutable tokens — All token types are immutable record classes
Error recovery — Gracefully handles malformed input with ErrorToken and continues parsing

Installation

dotnet add package TinyTokenizer

Or add a reference to the TinyTokenizer project in your solution.

Quick Start

using TinyTokenizer;

// Simple tokenization
var tokens = "func(a, b)".TokenizeToTokens();

// tokens contains:
// - IdentToken("func")
// - BlockToken("(a, b)") with children:
//   - IdentToken("a")
//   - SymbolToken(",")
//   - WhitespaceToken(" ")
//   - IdentToken("b")

// With options
var options = TokenizerOptions.Default
    .WithCommentStyles(CommentStyle.CStyleSingleLine, CommentStyle.CStyleMultiLine);

var tokens = "x = 42; // comment".TokenizeToTokens(options);

Architecture

TinyTokenizer uses a two-level tokenization architecture:

Level 1: Lexer

The Lexer is a stateless character classifier that never backtracks and never fails. It produces SimpleToken instances representing atomic character sequences:

Ident — identifier characters
Whitespace — spaces, tabs (excluding newlines)
Newline — \n or \r\n (separate for single-line comment detection)
Digits — consecutive digit characters
Symbol — configured symbol characters
Dot, Slash, Asterisk, Backslash — special characters for parsing
OpenBrace/CloseBrace, OpenBracket/CloseBracket, OpenParen/CloseParen
SingleQuote, DoubleQuote

Level 2: TokenParser

The TokenParser combines simple tokens into semantic tokens:

Block parsing — recursive nesting of {}, [], ()
String literals — quoted strings with escape sequences
Numeric literals — integers and floating-point numbers
Comments — single-line and multi-line
Error recovery — produces ErrorToken for malformed input

// Using two-level API directly
var lexer = new Lexer(options);
var parser = new TokenParser(options);

var simpleTokens = lexer.Lex(source);
var tokens = parser.ParseToArray(simpleTokens);

Token Types

Type	Description	Example
`IdentToken`	Identifier/text content	`hello`, `func`, `_name`
`WhitespaceToken`	Spaces, tabs, newlines	, `\t`, `\n`
`SymbolToken`	Configurable symbol characters	`/`, `:`, `,`, `;`
`OperatorToken`	Multi-character operators	`==`, `!=`, `&&`, `\\|\\|`, `->`
`TaggedIdentToken`	Tag prefix + identifier	`#define`, `@Override`, `$var`
`NumericToken`	Integer or floating-point numbers	`123`, `3.14`, `.5`
`StringToken`	Quoted string literals	`"hello"`, `'c'`
`CommentToken`	Single or multi-line comments	`// comment`, `/* block */`
`BlockToken`	Declaration blocks with delimiters	`{...}`, `[...]`, `(...)`
`ErrorToken`	Parsing errors (unmatched delimiters)	`}` without opening `{`

Token Properties

// All tokens have Position tracking
var token = tokens[0];
long position = token.Position;  // Character offset in source

// NumericToken
var num = (NumericToken)token;
num.NumericType;  // NumericType.Integer or NumericType.FloatingPoint

// StringToken
var str = (StringToken)token;
str.Quote;  // '"' or '\''
str.Value;  // Content without quotes (ReadOnlySpan<char>)

// CommentToken
var comment = (CommentToken)token;
comment.IsMultiLine;  // true for /* */, false for //

// BlockToken
var block = (BlockToken)token;
block.FullContent;       // "{inner content}" (includes delimiters)
block.InnerContent;      // "inner content" (excludes delimiters)
block.Children;          // ImmutableArray<Token> of parsed inner tokens
block.OpeningDelimiter;  // '{'
block.ClosingDelimiter;  // '}'

// OperatorToken
var op = (OperatorToken)token;
op.Operator;  // "==" or "!=" etc. (string)

// TaggedIdentToken
var tagged = (TaggedIdentToken)token;
tagged.Tag;      // '#' or '@' or '$' etc.
tagged.NameSpan; // "define" or "Override" (ReadOnlySpan<char>)

Async Tokenization

Tokenize streams and pipes asynchronously:

using TinyTokenizer;

// From Stream
await using var stream = File.OpenRead("source.txt");
var tokens = await stream.TokenizeAsync();

// Streaming with IAsyncEnumerable
await foreach (var token in stream.TokenizeStreamingAsync())
{
    Console.WriteLine(token);
}

// From PipeReader
var pipeReader = PipeReader.Create(stream);
var tokens = await pipeReader.TokenizeAsync();

// With custom encoding
var tokens = await stream.TokenizeAsync(
    options: TokenizerOptions.Default,
    encoding: Encoding.UTF8,
    leaveOpen: false,
    cancellationToken: ct);

Configuration

Symbols

// Default symbols: / : , ; = + - * < > ! & | . @ # ? % ^ ~ \
var options = TokenizerOptions.Default;

// Add custom symbols
options = options.WithAdditionalSymbols('$', '_');

// Remove symbols (they become part of identifier tokens)
options = options.WithoutSymbols('/');

// Replace entire symbol set
options = options.WithSymbols(':', ',', ';');

Comment Styles

// Built-in comment styles
CommentStyle.CStyleSingleLine   // //
CommentStyle.CStyleMultiLine    // /* */
CommentStyle.HashSingleLine     // #
CommentStyle.SqlSingleLine      // --
CommentStyle.HtmlComment        // 

// Configure tokenizer with comments
var options = TokenizerOptions.Default
    .WithCommentStyles(
        CommentStyle.CStyleSingleLine,
        CommentStyle.CStyleMultiLine);

// Add additional comment styles
options = options.WithAdditionalCommentStyles(CommentStyle.HashSingleLine);

// Custom comment style
var customComment = new CommentStyle("REM", null);  // Single-line ending at newline
var blockComment = new CommentStyle("(*", "*)");    // Multi-line Pascal-style

Operators

// Built-in operator sets
CommonOperators.Universal   // +, -, *, /, %, ==, !=, <, >, <=, >=, &&, ||, !, =, +=, -=, *=, /=
CommonOperators.CFamily     // Universal + ++, --, &, |, ^, ~, <<, >>, ->, ::, etc.
CommonOperators.JavaScript  // CFamily + ===, !==, =>, ?., ??, ??=, **
CommonOperators.Python      // Universal + //, **, ->, :=, @, &, |, ^, ~
CommonOperators.Sql         // Universal + <>, ::

// Configure operators (uses greedy matching - longest operator first)
var options = TokenizerOptions.Default
    .WithOperators(CommonOperators.CFamily);

// Add custom operators
options = options.WithAdditionalOperators("<=>", "??", "?.");

// Remove specific operators
options = options.WithoutOperators("++", "--");

// No operators (all symbols emit individually)
options = options.WithNoOperators();

Tagged Identifiers

Tagged identifiers recognize patterns like #define, @attribute, or $variable:

// Enable C-style preprocessor tags
var options = TokenizerOptions.Default.WithTagPrefixes('#');
var tokens = "#include #define".TokenizeToTokens(options);
// tokens: TaggedIdentToken("#include"), WhitespaceToken, TaggedIdentToken("#define")

// Enable Java/C# style annotations
var options = TokenizerOptions.Default.WithTagPrefixes('@');
var tokens = "@Override @NotNull".TokenizeToTokens(options);

// Enable shell/PHP style variables
var options = TokenizerOptions.Default.WithTagPrefixes('$');
var tokens = "$name $count".TokenizeToTokens(options);

// Multiple prefixes for multi-language support
var options = TokenizerOptions.Default.WithTagPrefixes('#', '@', '$');

// Add/remove prefixes
options = options.WithAdditionalTagPrefixes('~');
options = options.WithoutTagPrefixes('#');
options = options.WithNoTagPrefixes();  // Disable all

Note: Tag prefix characters are automatically treated as symbols by the Lexer, so any character can be used as a tag prefix.

Nested Blocks

Declaration blocks are parsed recursively:

var tokens = "{outer [inner (deepest)]}".TokenizeToTokens();

var braceBlock = (BlockToken)tokens[0];                  // {outer [inner (deepest)]}
var bracketBlock = (BlockToken)braceBlock.Children[2];   // [inner (deepest)]
var parenBlock = (BlockToken)bracketBlock.Children[2];   // (deepest)

Error Handling

The tokenizer produces ErrorToken for malformed input and continues parsing:

var tokens = "}hello{".TokenizeToTokens();

// tokens contains:
// - ErrorToken("}", "Unexpected closing delimiter '}'", position: 0)
// - IdentToken("hello")
// - ErrorToken("{", "Unclosed block starting with '{'", position: 6)

// Check for errors
if (tokens.HasErrors())
{
    foreach (var error in tokens.GetErrors())
    {
        Console.WriteLine($"Error at {error.Position}: {error.ErrorMessage}");
    }
}

Utility Extensions

Extensions on ImmutableArray<Token> for common operations:

// Check if any errors exist (including nested)
bool hasErrors = tokens.HasErrors();

// Get all errors (including nested)
IEnumerable<ErrorToken> errors = tokens.GetErrors();

// Get all tokens of a specific type (including nested)
IEnumerable<IdentToken> idents = tokens.OfTokenType<IdentToken>();
IEnumerable<BlockToken> blocks = tokens.OfTokenType<BlockToken>();
IEnumerable<NumericToken> numbers = tokens.OfTokenType<NumericToken>();

Benchmarks

Performance comparison of the optimized SearchValues<char> implementation vs the baseline ImmutableHashSet<char>:

Input Size	Baseline	Optimized	Speedup
Small (~50 chars)	377 ns	245 ns	1.54x
Medium (~1KB)	6,866 ns	3,020 ns	2.27x
Large (~100KB)	1,907 μs	781 μs	2.44x
JSON (~10KB)	130 μs	87 μs	1.51x
Whitespace-heavy	9,808 ns	3,661 ns	2.68x

Run benchmarks yourself:

dotnet run -c Release --project TinyTokenizer.Benchmarks -- --filter "*"

API Reference

Extension Methods

// String extensions
ImmutableArray<Token> TokenizeToTokens(this string source, TokenizerOptions? options = null)

// ReadOnlyMemory<char> extensions
ImmutableArray<Token> Tokenize(this ReadOnlyMemory<char> source, TokenizerOptions? options = null)

// Stream extensions
Task<ImmutableArray<Token>> TokenizeAsync(this Stream, TokenizerOptions?, Encoding?, bool leaveOpen, CancellationToken)
IAsyncEnumerable<Token> TokenizeStreamingAsync(this Stream, TokenizerOptions?, Encoding?, bool leaveOpen, CancellationToken)

// PipeReader extensions
Task<ImmutableArray<Token>> TokenizeAsync(this PipeReader, TokenizerOptions?, Encoding?, CancellationToken)
IAsyncEnumerable<Token> TokenizeStreamingAsync(this PipeReader, TokenizerOptions?, Encoding?, CancellationToken)

Tokenizer (ref struct)

public ref struct Tokenizer
{
    public Tokenizer(ReadOnlyMemory<char> source, TokenizerOptions? options = null);
    public ImmutableArray<Token> Tokenize();
}

Lexer

public sealed class Lexer
{
    public Lexer();
    public Lexer(ImmutableHashSet<char> symbols);
    public Lexer(TokenizerOptions options);

    public IEnumerable<SimpleToken> Lex(ReadOnlyMemory<char> input);
    public IEnumerable<SimpleToken> Lex(string input);
    public ImmutableArray<SimpleToken> LexToArray(ReadOnlyMemory<char> input);
}

TokenParser

public sealed class TokenParser
{
    public TokenParser();
    public TokenParser(TokenizerOptions options);

    public IEnumerable<Token> Parse(IEnumerable<SimpleToken> simpleTokens);
    public ImmutableArray<Token> ParseToArray(IEnumerable<SimpleToken> simpleTokens);
}

Token (abstract record)

public abstract record Token(ReadOnlyMemory<char> Content, TokenType Type, long Position)
{
    public ReadOnlySpan<char> ContentSpan { get; }
}

TokenType (enum)

public enum TokenType
{
    BraceBlock,       // { }
    BracketBlock,     // [ ]
    ParenthesisBlock, // ( )
    Symbol,           // configurable characters
    Ident,            // identifiers
    Whitespace,       // spaces, tabs, newlines
    Numeric,          // numbers
    String,           // quoted strings
    Comment,          // comments
    Error,            // parsing errors
    Operator,         // multi-character operators
    TaggedIdent       // tag prefix + identifier
}

Requirements

.NET 8.0 or later

License

MIT

Product	Compatible and additional computed target framework versions.
.NET	net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

Product

.NET

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net8.0
- System.IO.Pipelines (>= 8.0.0)

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
0.10.0	150	1/7/2026
0.9.0	123	1/5/2026
0.8.0	131	1/4/2026
0.7.0	127	1/3/2026
0.6.8	122	1/2/2026
0.6.7	123	1/2/2026
0.6.6	121	1/2/2026
0.6.5	124	1/1/2026
0.6.4	124	1/1/2026
0.6.3	120	1/1/2026
0.6.2	126	1/1/2026
0.6.1	119	12/31/2025
0.6.0	125	12/31/2025
0.5.1	122	12/31/2025
0.5.0	126	12/30/2025
0.4.1	116	12/29/2025
0.4.0	113	12/29/2025
0.3.0	123	12/27/2025
0.2.0	189	12/26/2025
0.1.0	200	12/25/2025