Luthor-tool 1.8.0

dotnet tool install --global Luthor-tool --version 1.8.0
                    
This package contains a .NET tool you can call from the shell/command line.
dotnet new tool-manifest
                    
if you are setting up this repo
dotnet tool install --local Luthor-tool --version 1.8.0
                    
This package contains a .NET tool you can call from the shell/command line.
#tool dotnet:?package=Luthor-tool&version=1.8.0
                    
nuke :add-package Luthor-tool --version 1.8.0
                    

Luthor

A fast, compact lexer generator that produces simple integer arrays for efficient text matching in any programming language.

Key Features:

  • Direct-to-DFA conversion (no intermediate NFA)
  • Partial lazy quantifier support (??, *?, +?) - all expressions are accepted but due to limitations of DFA traversal some complicated lazy matches may end up partly or entirely greedy.
  • Unicode support (UTF-8, UTF-16, UTF-32)
  • Compact array output suitable for embedded systems
  • Language-agnostic - use the arrays in C, C++, Rust, etc.

What Luthor Does

Luthor takes regular expressions or lexer grammars and converts them into simple integer arrays that can be embedded in any program. Instead of shipping a complex regex engine, you get a small array that can be walked with a simple matching loop.

Example: The pattern [A-Za-z_][A-Za-z0-9_]* becomes:

int dfa[] = {-1, 0, 1, 11, 3, 65, 90, 95, 95, 97, 122, 0, 0, 1, 11, 4, 48, 57, 65, 90, 95, 95, 97, 122};

Technical Background

Luthor implements several advanced algorithms:

  • Direct-to-DFA construction using the Aho-Sethi-Ullman algorithm (Dragon Book)
  • Lazy matching using Dr. Robert van Engelen's algorithm from RE/FLEX
  • Hopcroft minimization for optimal state count
  • Unicode range splitting for efficient multi-byte encoding support

Acknowledgements

Special thanks to Dr. Robert van Engelen for his guidance on implementing lazy matching in pure DFA form.

Quick Start

Installation

dotnet tool install luthor-tool -g

Basic Usage

# Single expression
luthor "[0-9]+"

Lexer from file

luthor mylexer.txt

Generate state diagram

luthor mylexer.txt --graph output.png

Lexer Input Format

Single Expression

For a single regular expression, just provide the pattern:

luthor "if|while|for"

Lexer Grammar

For a full lexer, create a text file with one rule per line:

# C-like Language Lexer
# Comments start with # (must have space or be alone)

# Keywords
if|while|for|int|void

# Identifiers  
[A-Za-z_][A-Za-z0-9_]*

# Integers (decimal and hex)
0x[0-9a-fA-F]+|[0-9]+

# C-style comments
/\*(.|\n)*?\*/

# Line comments  
//.*$

# Whitespace (usually ignored)
[ \t\n\r]+

Rules:

  • Each expression on its own line
  • Comments: # text or # alone. #foo (no space) will be interpreted as a regular expression
  • Accept IDs assigned by line order (0, 1, 2, ...)
  • Supports POSIX character classes: [[:digit:]], [[:alpha:]], etc.
  • Lazy quantifiers: *?, +?, ??, {1,3}?

Array Format

Luthor generates two array types:

  • Range arrays: Use min/max pairs [65,90] for [A-Z] (more compact for large character sets)
  • Non-range arrays: List individual codepoints (better for sparse patterns)

They are distinguished by having an even length (range arrays), or an odd length (non range arrays), and may be padded with a single -1 at the end to enforce the length.

Array Structure

[accept_id, anchor_mask, transition_count, ...]

For each state:

  1. accept_id: Token ID (-1 if non-accepting)
  2. anchor_mask: 1=line start ^, 2=line end $
  3. transition_count: Number of destination states
  4. For each destination:
    • dest_state_index: Target state index
    • transition_entry__count: Number of entries
    • transition_entry_: Either [min,max] pairs (range array) or individual codepoints (non range array)

Example Breakdown

// Pattern: [A-Z__]+
int dfa[] = {
    -1, 0, 1,        // State 0: not accepting, no anchors, 1 transition
    11, 3,           // -> jump to index 11 on 3 ranges:
    65, 90,          //    A-Z (65-90)
    95, 95,          //    _ (95)  
    97, 122,         //    a-z (97-122)
    
    0, 0, 1,         // State 11: accepting (token 0), no anchors, 1 transition
    11, 4,           // -> jump to index 11 on 4 ranges:
    48, 57,          //    0-9 (48-57)
    65, 90,          //    A-Z (65-90)  
    95, 95,          //    _ (95)
    97, 122          //    a-z (97-122)
};

Using the Generated Arrays

Simple C Lexer (Ranged arrays)

#include <stdio.h>
#include <stdbool.h>

static int lex_ranged(const char** data, const int* dfa) {
    const char* sz = *data;
    int current_state_index = 0;
    bool at_line_start = true;
    bool at_line_end = (*sz == '\0') || (*sz != '\0' && sz[1] == '\0' && *sz == '\n');

    while (*sz != '\0' || at_line_end) {
        int machine_index = current_state_index;
        int accept_id = dfa[machine_index++];
        int anchor_mask = dfa[machine_index++];
        int transition_count = dfa[machine_index++];

        // SPECIAL CASE: Check for $ anchor before newline
        if (accept_id != -1 && (anchor_mask & 2) && *sz == '\n') {
            bool anchor_valid = true;
            if ((anchor_mask & 1) && !at_line_start) {
                anchor_valid = false;
            }
            // End anchor is valid since we're checking before newline
            if (anchor_valid) {
                *data = sz;
                return accept_id;
            }
        }

        for (int t = 0; t < transition_count; t++) {
            int dest_state_index = dfa[machine_index++];
            int range_count = dfa[machine_index++];
            for (int r = 0; r < range_count; r++) {
                int min = dfa[machine_index++];
                int max = dfa[machine_index++];
                if (min >= 0 && *sz != '\0') {
                    int c = (unsigned char)*sz;
                    if (c < min) {
                        machine_index += ((range_count - (r + 1)) * 2);
                        break;
                    }
                    if (c <= max) {
                        current_state_index = dest_state_index;
                        sz++;
                        at_line_end = (*sz == '\0') || (*sz == '\n');
                        at_line_start = (c == '\n');
                        goto found;
                    }
                }
            }
        }
        break;
    found:
        accept_id = dfa[current_state_index];
        anchor_mask = dfa[current_state_index + 1];
        if (accept_id != -1) {
            bool anchor_valid = true;
            if ((anchor_mask & 1) && !at_line_start) {
                anchor_valid = false;
            }
            if ((anchor_mask & 2) && !at_line_end) {
                anchor_valid = false;
            }

            if (anchor_valid) {
                *data = sz;
                return accept_id;
            }
        }
        if (*sz == '\0') break;
    }
    *data = sz;
    return -1;  // No valid match found
}

int main() {
    static const int identifier_dfa[] = {
        -1, 0, 1, 11, 3, 65, 90, 95, 95, 97, 122,
        0, 0, 1, 11, 4, 48, 57, 65, 90, 95, 95, 97, 122
    };
    
    const char* input = "hello_world123";
    const char* pos = input;
    
    int token = lex_ranged(&pos, identifier_dfa);
    printf("Matched token %d, consumed: '%.*s'\n", 
           token, (int)(pos - input), input);
    
    return 0;
}

Multi-Language Support

The integer arrays work in any language:

Rust:

const DFA: &[i32] = &[-1, 0, 1, 11, 3, 65, 90, 95, 95, 97, 122, /*...*/];

Python:

dfa = [-1, 0, 1, 11, 3, 65, 90, 95, 95, 97, 122, ...]

JavaScript:

const dfa = [-1, 0, 1, 11, 3, 65, 90, 95, 95, 97, 122, /*...*/];

CLI Reference

luthor

A DFA lexer generator tool

Usage:

luthor <input> [ --enc <encoding> ] [ --array <array> ] [ --graph <graph> ] [ --draft <draft> ]

<input>        The input expression or file to use
<encoding>     The encoding to use (ASCII, UTF-8, UTF-16, or UTF-32, or a single byte encoding). Defaults to UTF-8
<array>        The type of array to generate, ranged or unranged. Defaults to auto which chooses the shorted length
<graph>        Generate a DFA state graph to the specified file (requires GraphViz)
<draft>        Generate a DFA state graph draft to the specified file (requires GraphViz)

luthor [ --? ]

--?            Displays this help screen
# Generate UTF-8 lexer for C identifiers
luthor "[A-Za-z_][A-Za-z0-9_]*"

# Create lexer from grammar file  
luthor c_lexer.txt --enc UTF-16

# Generate lexer with state diagram
luthor json_lexer.txt --graph json_states.png

Performance Characteristics

  • Memory: Compact arrays, typically KB-sized even for complex lexers
  • Speed: Simple array traversal, no backtracking
  • Predictable: DFA guarantees O(n) matching time
  • Embeddable: No external dependencies, just integer arrays

License

MIT License - Copyright (c) 2025 honey the codewitch

Lazy matching implementation based on Dr. Robert van Engelen's algorithm from RE/FLEX (all rights reserved to Dr. van Engelen)

Product Compatible and additional computed target framework versions.
.NET net8.0 is compatible.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed.  net9.0 was computed.  net9.0-android was computed.  net9.0-browser was computed.  net9.0-ios was computed.  net9.0-maccatalyst was computed.  net9.0-macos was computed.  net9.0-tvos was computed.  net9.0-windows was computed.  net10.0 was computed.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

This package has no dependencies.

Version Downloads Last Updated
1.8.0 116 7/30/2025
1.7.0 114 7/30/2025
1.6.0 112 7/30/2025
1.5.0 113 7/30/2025
1.4.0 112 7/30/2025
1.3.0 94 7/28/2025
1.2.0 94 7/28/2025
1.1.0 93 7/27/2025
1.0.0 92 7/27/2025