Luthor-tool 1.8.0

.NET 8.0

dotnet tool install --global Luthor-tool --version 1.8.0

This package contains a .NET tool you can call from the shell/command line.

dotnet new tool-manifest
                    

                            if you are setting up this repo

dotnet tool install --local Luthor-tool --version 1.8.0

This package contains a .NET tool you can call from the shell/command line.

#tool dotnet:?package=Luthor-tool&version=1.8.0

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

nuke :add-package Luthor-tool --version 1.8.0

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

Luthor

A fast, compact lexer generator that produces simple integer arrays for efficient text matching in any programming language.

Key Features:

Direct-to-DFA conversion (no intermediate NFA)
Partial lazy quantifier support (??, *?, +?) - all expressions are accepted but due to limitations of DFA traversal some complicated lazy matches may end up partly or entirely greedy.
Unicode support (UTF-8, UTF-16, UTF-32)
Compact array output suitable for embedded systems
Language-agnostic - use the arrays in C, C++, Rust, etc.

What Luthor Does

Luthor takes regular expressions or lexer grammars and converts them into simple integer arrays that can be embedded in any program. Instead of shipping a complex regex engine, you get a small array that can be walked with a simple matching loop.

Example: The pattern [A-Za-z_][A-Za-z0-9_]* becomes:

int dfa[] = {-1, 0, 1, 11, 3, 65, 90, 95, 95, 97, 122, 0, 0, 1, 11, 4, 48, 57, 65, 90, 95, 95, 97, 122};

Technical Background

Luthor implements several advanced algorithms:

Direct-to-DFA construction using the Aho-Sethi-Ullman algorithm (Dragon Book)
Lazy matching using Dr. Robert van Engelen's algorithm from RE/FLEX
Hopcroft minimization for optimal state count
Unicode range splitting for efficient multi-byte encoding support

Acknowledgements

Special thanks to Dr. Robert van Engelen for his guidance on implementing lazy matching in pure DFA form.

Quick Start

Installation

dotnet tool install luthor-tool -g

Basic Usage

# Single expression
luthor "[0-9]+"

Lexer from file

luthor mylexer.txt

Generate state diagram

luthor mylexer.txt --graph output.png

Lexer Input Format

Single Expression

For a single regular expression, just provide the pattern:

luthor "if|while|for"

Lexer Grammar

For a full lexer, create a text file with one rule per line:

# C-like Language Lexer
# Comments start with # (must have space or be alone)

# Keywords
if|while|for|int|void

# Identifiers  
[A-Za-z_][A-Za-z0-9_]*

# Integers (decimal and hex)
0x[0-9a-fA-F]+|[0-9]+

# C-style comments
/\*(.|\n)*?\*/

# Line comments  
//.*$

# Whitespace (usually ignored)
[ \t\n\r]+

Rules:

Each expression on its own line
Comments: # text or # alone. #foo (no space) will be interpreted as a regular expression
Accept IDs assigned by line order (0, 1, 2, ...)
Supports POSIX character classes: [[:digit:]], [[:alpha:]], etc.
Lazy quantifiers: *?, +?, ??, {1,3}?

Array Format

Luthor generates two array types:

Range arrays: Use min/max pairs [65,90] for [A-Z] (more compact for large character sets)
Non-range arrays: List individual codepoints (better for sparse patterns)

They are distinguished by having an even length (range arrays), or an odd length (non range arrays), and may be padded with a single -1 at the end to enforce the length.

Array Structure

[accept_id, anchor_mask, transition_count, ...]

For each state:

accept_id: Token ID (-1 if non-accepting)
anchor_mask: 1=line start ^, 2=line end $
transition_count: Number of destination states
For each destination:
- dest_state_index: Target state index
- transition_entry__count: Number of entries
- transition_entry_: Either [min,max] pairs (range array) or individual codepoints (non range array)

Example Breakdown

// Pattern: [A-Z__]+
int dfa[] = {
    -1, 0, 1,        // State 0: not accepting, no anchors, 1 transition
    11, 3,           // -> jump to index 11 on 3 ranges:
    65, 90,          //    A-Z (65-90)
    95, 95,          //    _ (95)  
    97, 122,         //    a-z (97-122)
    
    0, 0, 1,         // State 11: accepting (token 0), no anchors, 1 transition
    11, 4,           // -> jump to index 11 on 4 ranges:
    48, 57,          //    0-9 (48-57)
    65, 90,          //    A-Z (65-90)  
    95, 95,          //    _ (95)
    97, 122          //    a-z (97-122)
};

Using the Generated Arrays

Simple C Lexer (Ranged arrays)

#include <stdio.h>
#include <stdbool.h>

static int lex_ranged(const char** data, const int* dfa) {
    const char* sz = *data;
    int current_state_index = 0;
    bool at_line_start = true;
    bool at_line_end = (*sz == '\0') || (*sz != '\0' && sz[1] == '\0' && *sz == '\n');

    while (*sz != '\0' || at_line_end) {
        int machine_index = current_state_index;
        int accept_id = dfa[machine_index++];
        int anchor_mask = dfa[machine_index++];
        int transition_count = dfa[machine_index++];

        // SPECIAL CASE: Check for $ anchor before newline
        if (accept_id != -1 && (anchor_mask & 2) && *sz == '\n') {
            bool anchor_valid = true;
            if ((anchor_mask & 1) && !at_line_start) {
                anchor_valid = false;
            }
            // End anchor is valid since we're checking before newline
            if (anchor_valid) {
                *data = sz;
                return accept_id;
            }
        }

        for (int t = 0; t < transition_count; t++) {
            int dest_state_index = dfa[machine_index++];
            int range_count = dfa[machine_index++];
            for (int r = 0; r < range_count; r++) {
                int min = dfa[machine_index++];
                int max = dfa[machine_index++];
                if (min >= 0 && *sz != '\0') {
                    int c = (unsigned char)*sz;
                    if (c < min) {
                        machine_index += ((range_count - (r + 1)) * 2);
                        break;
                    }
                    if (c <= max) {
                        current_state_index = dest_state_index;
                        sz++;
                        at_line_end = (*sz == '\0') || (*sz == '\n');
                        at_line_start = (c == '\n');
                        goto found;
                    }
                }
            }
        }
        break;
    found:
        accept_id = dfa[current_state_index];
        anchor_mask = dfa[current_state_index + 1];
        if (accept_id != -1) {
            bool anchor_valid = true;
            if ((anchor_mask & 1) && !at_line_start) {
                anchor_valid = false;
            }
            if ((anchor_mask & 2) && !at_line_end) {
                anchor_valid = false;
            }

            if (anchor_valid) {
                *data = sz;
                return accept_id;
            }
        }
        if (*sz == '\0') break;
    }
    *data = sz;
    return -1;  // No valid match found
}

int main() {
    static const int identifier_dfa[] = {
        -1, 0, 1, 11, 3, 65, 90, 95, 95, 97, 122,
        0, 0, 1, 11, 4, 48, 57, 65, 90, 95, 95, 97, 122
    };
    
    const char* input = "hello_world123";
    const char* pos = input;
    
    int token = lex_ranged(&pos, identifier_dfa);
    printf("Matched token %d, consumed: '%.*s'\n", 
           token, (int)(pos - input), input);
    
    return 0;
}

Multi-Language Support

The integer arrays work in any language:

Rust:

const DFA: &[i32] = &[-1, 0, 1, 11, 3, 65, 90, 95, 95, 97, 122, /*...*/];

Python:

dfa = [-1, 0, 1, 11, 3, 65, 90, 95, 95, 97, 122, ...]

JavaScript:

const dfa = [-1, 0, 1, 11, 3, 65, 90, 95, 95, 97, 122, /*...*/];

CLI Reference

luthor

A DFA lexer generator tool

Usage:

luthor <input> [ --enc <encoding> ] [ --array <array> ] [ --graph <graph> ] [ --draft <draft> ]

<input>        The input expression or file to use
<encoding>     The encoding to use (ASCII, UTF-8, UTF-16, or UTF-32, or a single byte encoding). Defaults to UTF-8
<array>        The type of array to generate, ranged or unranged. Defaults to auto which chooses the shorted length
<graph>        Generate a DFA state graph to the specified file (requires GraphViz)
<draft>        Generate a DFA state graph draft to the specified file (requires GraphViz)

luthor [ --? ]

--?            Displays this help screen

# Generate UTF-8 lexer for C identifiers
luthor "[A-Za-z_][A-Za-z0-9_]*"

# Create lexer from grammar file  
luthor c_lexer.txt --enc UTF-16

# Generate lexer with state diagram
luthor json_lexer.txt --graph json_states.png

Performance Characteristics

Memory: Compact arrays, typically KB-sized even for complex lexers
Speed: Simple array traversal, no backtracking
Predictable: DFA guarantees O(n) matching time
Embeddable: No external dependencies, just integer arrays

License

Product	Compatible and additional computed target framework versions.
.NET	net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

Product

.NET

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

This package has no dependencies.

Version	Downloads	Last Updated
1.8.0	116	7/30/2025
1.7.0	114	7/30/2025
1.6.0	112	7/30/2025
1.5.0	113	7/30/2025
1.4.0	112	7/30/2025
1.3.0	94	7/28/2025
1.2.0	94	7/28/2025
1.1.0	93	7/27/2025
1.0.0	92	7/27/2025