TinyTokenizer 0.3.0

There is a newer version of this package available.
See the version list below for details.
dotnet add package TinyTokenizer --version 0.3.0
                    
NuGet\Install-Package TinyTokenizer -Version 0.3.0
                    
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="TinyTokenizer" Version="0.3.0" />
                    
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="TinyTokenizer" Version="0.3.0" />
                    
Directory.Packages.props
<PackageReference Include="TinyTokenizer" />
                    
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add TinyTokenizer --version 0.3.0
                    
#r "nuget: TinyTokenizer, 0.3.0"
                    
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package TinyTokenizer@0.3.0
                    
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=TinyTokenizer&version=0.3.0
                    
Install as a Cake Addin
#tool nuget:?package=TinyTokenizer&version=0.3.0
                    
Install as a Cake Tool

TinyTokenizer

A high-performance, zero-allocation tokenizer library for .NET 8+ that parses text into abstract tokens using ReadOnlySpan<char> and SIMD-optimized SearchValues<char> for maximum efficiency.

Features

  • Zero-allocation parsing — Uses ReadOnlySpan<char> internally for fast, allocation-free text traversal
  • SIMD-optimized — Uses .NET 8's SearchValues<char> for vectorized character matching
  • Two-level architecture — Lexer (character classification) + TokenParser (semantic parsing)
  • Async streaming — Tokenize Stream and PipeReader sources with IAsyncEnumerable<Token>
  • Recursive declaration blocks — Automatically parses nested {}, [], and () blocks with child tokens
  • String literals — Supports single and double-quoted strings with escape sequences
  • Numeric literals — Parses integers and floating-point numbers
  • Comment support — Configurable single-line and multi-line comment styles
  • Operators — Configurable multi-character operators with greedy matching
  • Tagged identifiers — Configurable prefix characters for patterns like #define, @attribute, $variable
  • Configurable symbols — Define which characters are recognized as symbol tokens
  • Immutable tokens — All token types are immutable record classes
  • Error recovery — Gracefully handles malformed input with ErrorToken and continues parsing

Installation

dotnet add package TinyTokenizer

Or add a reference to the TinyTokenizer project in your solution.

Quick Start

using TinyTokenizer;

// Simple tokenization
var tokens = "func(a, b)".TokenizeToTokens();

// tokens contains:
// - IdentToken("func")
// - BlockToken("(a, b)") with children:
//   - IdentToken("a")
//   - SymbolToken(",")
//   - WhitespaceToken(" ")
//   - IdentToken("b")

// With options
var options = TokenizerOptions.Default
    .WithCommentStyles(CommentStyle.CStyleSingleLine, CommentStyle.CStyleMultiLine);

var tokens = "x = 42; // comment".TokenizeToTokens(options);

Architecture

TinyTokenizer uses a two-level tokenization architecture:

Level 1: Lexer

The Lexer is a stateless character classifier that never backtracks and never fails. It produces SimpleToken instances representing atomic character sequences:

  • Ident — identifier characters
  • Whitespace — spaces, tabs (excluding newlines)
  • Newline\n or \r\n (separate for single-line comment detection)
  • Digits — consecutive digit characters
  • Symbol — configured symbol characters
  • Dot, Slash, Asterisk, Backslash — special characters for parsing
  • OpenBrace/CloseBrace, OpenBracket/CloseBracket, OpenParen/CloseParen
  • SingleQuote, DoubleQuote

Level 2: TokenParser

The TokenParser combines simple tokens into semantic tokens:

  • Block parsing — recursive nesting of {}, [], ()
  • String literals — quoted strings with escape sequences
  • Numeric literals — integers and floating-point numbers
  • Comments — single-line and multi-line
  • Error recovery — produces ErrorToken for malformed input
// Using two-level API directly
var lexer = new Lexer(options);
var parser = new TokenParser(options);

var simpleTokens = lexer.Lex(source);
var tokens = parser.ParseToArray(simpleTokens);

Token Types

Type Description Example
IdentToken Identifier/text content hello, func, _name
WhitespaceToken Spaces, tabs, newlines , \t, \n
SymbolToken Configurable symbol characters /, :, ,, ;
OperatorToken Multi-character operators ==, !=, &&, \|\|, ->
TaggedIdentToken Tag prefix + identifier #define, @Override, $var
NumericToken Integer or floating-point numbers 123, 3.14, .5
StringToken Quoted string literals "hello", 'c'
CommentToken Single or multi-line comments // comment, /* block */
BlockToken Declaration blocks with delimiters {...}, [...], (...)
ErrorToken Parsing errors (unmatched delimiters) } without opening {

Token Properties

// All tokens have Position tracking
var token = tokens[0];
long position = token.Position;  // Character offset in source

// NumericToken
var num = (NumericToken)token;
num.NumericType;  // NumericType.Integer or NumericType.FloatingPoint

// StringToken
var str = (StringToken)token;
str.Quote;  // '"' or '\''
str.Value;  // Content without quotes (ReadOnlySpan<char>)

// CommentToken
var comment = (CommentToken)token;
comment.IsMultiLine;  // true for /* */, false for //

// BlockToken
var block = (BlockToken)token;
block.FullContent;       // "{inner content}" (includes delimiters)
block.InnerContent;      // "inner content" (excludes delimiters)
block.Children;          // ImmutableArray<Token> of parsed inner tokens
block.OpeningDelimiter;  // '{'
block.ClosingDelimiter;  // '}'

// OperatorToken
var op = (OperatorToken)token;
op.Operator;  // "==" or "!=" etc. (string)

// TaggedIdentToken
var tagged = (TaggedIdentToken)token;
tagged.Tag;      // '#' or '@' or '$' etc.
tagged.NameSpan; // "define" or "Override" (ReadOnlySpan<char>)

Async Tokenization

Tokenize streams and pipes asynchronously:

using TinyTokenizer;

// From Stream
await using var stream = File.OpenRead("source.txt");
var tokens = await stream.TokenizeAsync();

// Streaming with IAsyncEnumerable
await foreach (var token in stream.TokenizeStreamingAsync())
{
    Console.WriteLine(token);
}

// From PipeReader
var pipeReader = PipeReader.Create(stream);
var tokens = await pipeReader.TokenizeAsync();

// With custom encoding
var tokens = await stream.TokenizeAsync(
    options: TokenizerOptions.Default,
    encoding: Encoding.UTF8,
    leaveOpen: false,
    cancellationToken: ct);

Configuration

Symbols

// Default symbols: / : , ; = + - * < > ! & | . @ # ? % ^ ~ \
var options = TokenizerOptions.Default;

// Add custom symbols
options = options.WithAdditionalSymbols('$', '_');

// Remove symbols (they become part of identifier tokens)
options = options.WithoutSymbols('/');

// Replace entire symbol set
options = options.WithSymbols(':', ',', ';');

Comment Styles

// Built-in comment styles
CommentStyle.CStyleSingleLine   // //
CommentStyle.CStyleMultiLine    // /* */
CommentStyle.HashSingleLine     // #
CommentStyle.SqlSingleLine      // --
CommentStyle.HtmlComment        // 

// Configure tokenizer with comments
var options = TokenizerOptions.Default
    .WithCommentStyles(
        CommentStyle.CStyleSingleLine,
        CommentStyle.CStyleMultiLine);

// Add additional comment styles
options = options.WithAdditionalCommentStyles(CommentStyle.HashSingleLine);

// Custom comment style
var customComment = new CommentStyle("REM", null);  // Single-line ending at newline
var blockComment = new CommentStyle("(*", "*)");    // Multi-line Pascal-style

Operators

// Built-in operator sets
CommonOperators.Universal   // +, -, *, /, %, ==, !=, <, >, <=, >=, &&, ||, !, =, +=, -=, *=, /=
CommonOperators.CFamily     // Universal + ++, --, &, |, ^, ~, <<, >>, ->, ::, etc.
CommonOperators.JavaScript  // CFamily + ===, !==, =>, ?., ??, ??=, **
CommonOperators.Python      // Universal + //, **, ->, :=, @, &, |, ^, ~
CommonOperators.Sql         // Universal + <>, ::

// Configure operators (uses greedy matching - longest operator first)
var options = TokenizerOptions.Default
    .WithOperators(CommonOperators.CFamily);

// Add custom operators
options = options.WithAdditionalOperators("<=>", "??", "?.");

// Remove specific operators
options = options.WithoutOperators("++", "--");

// No operators (all symbols emit individually)
options = options.WithNoOperators();

Tagged Identifiers

Tagged identifiers recognize patterns like #define, @attribute, or $variable:

// Enable C-style preprocessor tags
var options = TokenizerOptions.Default.WithTagPrefixes('#');
var tokens = "#include #define".TokenizeToTokens(options);
// tokens: TaggedIdentToken("#include"), WhitespaceToken, TaggedIdentToken("#define")

// Enable Java/C# style annotations
var options = TokenizerOptions.Default.WithTagPrefixes('@');
var tokens = "@Override @NotNull".TokenizeToTokens(options);

// Enable shell/PHP style variables
var options = TokenizerOptions.Default.WithTagPrefixes('$');
var tokens = "$name $count".TokenizeToTokens(options);

// Multiple prefixes for multi-language support
var options = TokenizerOptions.Default.WithTagPrefixes('#', '@', '$');

// Add/remove prefixes
options = options.WithAdditionalTagPrefixes('~');
options = options.WithoutTagPrefixes('#');
options = options.WithNoTagPrefixes();  // Disable all

Note: Tag prefix characters are automatically treated as symbols by the Lexer, so any character can be used as a tag prefix.

Nested Blocks

Declaration blocks are parsed recursively:

var tokens = "{outer [inner (deepest)]}".TokenizeToTokens();

var braceBlock = (BlockToken)tokens[0];                  // {outer [inner (deepest)]}
var bracketBlock = (BlockToken)braceBlock.Children[2];   // [inner (deepest)]
var parenBlock = (BlockToken)bracketBlock.Children[2];   // (deepest)

Error Handling

The tokenizer produces ErrorToken for malformed input and continues parsing:

var tokens = "}hello{".TokenizeToTokens();

// tokens contains:
// - ErrorToken("}", "Unexpected closing delimiter '}'", position: 0)
// - IdentToken("hello")
// - ErrorToken("{", "Unclosed block starting with '{'", position: 6)

// Check for errors
if (tokens.HasErrors())
{
    foreach (var error in tokens.GetErrors())
    {
        Console.WriteLine($"Error at {error.Position}: {error.ErrorMessage}");
    }
}

Utility Extensions

Extensions on ImmutableArray<Token> for common operations:

// Check if any errors exist (including nested)
bool hasErrors = tokens.HasErrors();

// Get all errors (including nested)
IEnumerable<ErrorToken> errors = tokens.GetErrors();

// Get all tokens of a specific type (including nested)
IEnumerable<IdentToken> idents = tokens.OfTokenType<IdentToken>();
IEnumerable<BlockToken> blocks = tokens.OfTokenType<BlockToken>();
IEnumerable<NumericToken> numbers = tokens.OfTokenType<NumericToken>();

Benchmarks

Performance comparison of the optimized SearchValues<char> implementation vs the baseline ImmutableHashSet<char>:

Input Size Baseline Optimized Speedup
Small (~50 chars) 377 ns 245 ns 1.54x
Medium (~1KB) 6,866 ns 3,020 ns 2.27x
Large (~100KB) 1,907 μs 781 μs 2.44x
JSON (~10KB) 130 μs 87 μs 1.51x
Whitespace-heavy 9,808 ns 3,661 ns 2.68x

Run benchmarks yourself:

dotnet run -c Release --project TinyTokenizer.Benchmarks -- --filter "*"

API Reference

Extension Methods

// String extensions
ImmutableArray<Token> TokenizeToTokens(this string source, TokenizerOptions? options = null)

// ReadOnlyMemory<char> extensions
ImmutableArray<Token> Tokenize(this ReadOnlyMemory<char> source, TokenizerOptions? options = null)

// Stream extensions
Task<ImmutableArray<Token>> TokenizeAsync(this Stream, TokenizerOptions?, Encoding?, bool leaveOpen, CancellationToken)
IAsyncEnumerable<Token> TokenizeStreamingAsync(this Stream, TokenizerOptions?, Encoding?, bool leaveOpen, CancellationToken)

// PipeReader extensions
Task<ImmutableArray<Token>> TokenizeAsync(this PipeReader, TokenizerOptions?, Encoding?, CancellationToken)
IAsyncEnumerable<Token> TokenizeStreamingAsync(this PipeReader, TokenizerOptions?, Encoding?, CancellationToken)

Tokenizer (ref struct)

public ref struct Tokenizer
{
    public Tokenizer(ReadOnlyMemory<char> source, TokenizerOptions? options = null);
    public ImmutableArray<Token> Tokenize();
}

Lexer

public sealed class Lexer
{
    public Lexer();
    public Lexer(ImmutableHashSet<char> symbols);
    public Lexer(TokenizerOptions options);

    public IEnumerable<SimpleToken> Lex(ReadOnlyMemory<char> input);
    public IEnumerable<SimpleToken> Lex(string input);
    public ImmutableArray<SimpleToken> LexToArray(ReadOnlyMemory<char> input);
}

TokenParser

public sealed class TokenParser
{
    public TokenParser();
    public TokenParser(TokenizerOptions options);

    public IEnumerable<Token> Parse(IEnumerable<SimpleToken> simpleTokens);
    public ImmutableArray<Token> ParseToArray(IEnumerable<SimpleToken> simpleTokens);
}

Token (abstract record)

public abstract record Token(ReadOnlyMemory<char> Content, TokenType Type, long Position)
{
    public ReadOnlySpan<char> ContentSpan { get; }
}

TokenType (enum)

public enum TokenType
{
    BraceBlock,       // { }
    BracketBlock,     // [ ]
    ParenthesisBlock, // ( )
    Symbol,           // configurable characters
    Ident,            // identifiers
    Whitespace,       // spaces, tabs, newlines
    Numeric,          // numbers
    String,           // quoted strings
    Comment,          // comments
    Error,            // parsing errors
    Operator,         // multi-character operators
    TaggedIdent       // tag prefix + identifier
}

Requirements

  • .NET 8.0 or later

License

MIT

Product Compatible and additional computed target framework versions.
.NET net8.0 is compatible.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed.  net9.0 was computed.  net9.0-android was computed.  net9.0-browser was computed.  net9.0-ios was computed.  net9.0-maccatalyst was computed.  net9.0-macos was computed.  net9.0-tvos was computed.  net9.0-windows was computed.  net10.0 was computed.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last Updated
0.10.0 150 1/7/2026 0.10.0 is deprecated.
0.9.0 123 1/5/2026
0.8.0 131 1/4/2026
0.7.0 127 1/3/2026
0.6.8 122 1/2/2026
0.6.7 123 1/2/2026
0.6.6 121 1/2/2026
0.6.5 124 1/1/2026
0.6.4 124 1/1/2026
0.6.3 120 1/1/2026
0.6.2 126 1/1/2026
0.6.1 119 12/31/2025
0.6.0 125 12/31/2025
0.5.1 122 12/31/2025
0.5.0 126 12/30/2025
0.4.1 116 12/29/2025
0.4.0 113 12/29/2025
0.3.0 123 12/27/2025
0.2.0 189 12/26/2025
0.1.0 200 12/25/2025