Luthor-tool
1.8.0
dotnet tool install --global Luthor-tool --version 1.8.0
dotnet new tool-manifest
dotnet tool install --local Luthor-tool --version 1.8.0
#tool dotnet:?package=Luthor-tool&version=1.8.0
nuke :add-package Luthor-tool --version 1.8.0
Luthor
A fast, compact lexer generator that produces simple integer arrays for efficient text matching in any programming language.
Key Features:
- Direct-to-DFA conversion (no intermediate NFA)
- Partial lazy quantifier support (
??
,*?
,+?
) - all expressions are accepted but due to limitations of DFA traversal some complicated lazy matches may end up partly or entirely greedy. - Unicode support (UTF-8, UTF-16, UTF-32)
- Compact array output suitable for embedded systems
- Language-agnostic - use the arrays in C, C++, Rust, etc.
What Luthor Does
Luthor takes regular expressions or lexer grammars and converts them into simple integer arrays that can be embedded in any program. Instead of shipping a complex regex engine, you get a small array that can be walked with a simple matching loop.
Example: The pattern [A-Za-z_][A-Za-z0-9_]*
becomes:
int dfa[] = {-1, 0, 1, 11, 3, 65, 90, 95, 95, 97, 122, 0, 0, 1, 11, 4, 48, 57, 65, 90, 95, 95, 97, 122};
Technical Background
Luthor implements several advanced algorithms:
- Direct-to-DFA construction using the Aho-Sethi-Ullman algorithm (Dragon Book)
- Lazy matching using Dr. Robert van Engelen's algorithm from RE/FLEX
- Hopcroft minimization for optimal state count
- Unicode range splitting for efficient multi-byte encoding support
Acknowledgements
Special thanks to Dr. Robert van Engelen for his guidance on implementing lazy matching in pure DFA form.
Quick Start
Installation
dotnet tool install luthor-tool -g
Basic Usage
# Single expression
luthor "[0-9]+"
Lexer from file
luthor mylexer.txt
Generate state diagram
luthor mylexer.txt --graph output.png
Lexer Input Format
Single Expression
For a single regular expression, just provide the pattern:
luthor "if|while|for"
Lexer Grammar
For a full lexer, create a text file with one rule per line:
# C-like Language Lexer
# Comments start with # (must have space or be alone)
# Keywords
if|while|for|int|void
# Identifiers
[A-Za-z_][A-Za-z0-9_]*
# Integers (decimal and hex)
0x[0-9a-fA-F]+|[0-9]+
# C-style comments
/\*(.|\n)*?\*/
# Line comments
//.*$
# Whitespace (usually ignored)
[ \t\n\r]+
Rules:
- Each expression on its own line
- Comments:
# text
or#
alone.#foo
(no space) will be interpreted as a regular expression - Accept IDs assigned by line order (0, 1, 2, ...)
- Supports POSIX character classes: [[:digit:]], [[:alpha:]], etc.
- Lazy quantifiers: *?, +?, ??, {1,3}?
Array Format
Luthor generates two array types:
- Range arrays: Use min/max pairs [65,90] for [A-Z] (more compact for large character sets)
- Non-range arrays: List individual codepoints (better for sparse patterns)
They are distinguished by having an even length (range arrays), or an odd length (non range arrays), and may be padded with a single -1 at the end to enforce the length.
Array Structure
[accept_id, anchor_mask, transition_count, ...]
For each state:
- accept_id: Token ID (-1 if non-accepting)
- anchor_mask: 1=line start ^, 2=line end $
- transition_count: Number of destination states
- For each destination:
- dest_state_index: Target state index
- transition_entry__count: Number of entries
- transition_entry_: Either [min,max] pairs (range array) or individual codepoints (non range array)
Example Breakdown
// Pattern: [A-Z__]+
int dfa[] = {
-1, 0, 1, // State 0: not accepting, no anchors, 1 transition
11, 3, // -> jump to index 11 on 3 ranges:
65, 90, // A-Z (65-90)
95, 95, // _ (95)
97, 122, // a-z (97-122)
0, 0, 1, // State 11: accepting (token 0), no anchors, 1 transition
11, 4, // -> jump to index 11 on 4 ranges:
48, 57, // 0-9 (48-57)
65, 90, // A-Z (65-90)
95, 95, // _ (95)
97, 122 // a-z (97-122)
};
Using the Generated Arrays
Simple C Lexer (Ranged arrays)
#include <stdio.h>
#include <stdbool.h>
static int lex_ranged(const char** data, const int* dfa) {
const char* sz = *data;
int current_state_index = 0;
bool at_line_start = true;
bool at_line_end = (*sz == '\0') || (*sz != '\0' && sz[1] == '\0' && *sz == '\n');
while (*sz != '\0' || at_line_end) {
int machine_index = current_state_index;
int accept_id = dfa[machine_index++];
int anchor_mask = dfa[machine_index++];
int transition_count = dfa[machine_index++];
// SPECIAL CASE: Check for $ anchor before newline
if (accept_id != -1 && (anchor_mask & 2) && *sz == '\n') {
bool anchor_valid = true;
if ((anchor_mask & 1) && !at_line_start) {
anchor_valid = false;
}
// End anchor is valid since we're checking before newline
if (anchor_valid) {
*data = sz;
return accept_id;
}
}
for (int t = 0; t < transition_count; t++) {
int dest_state_index = dfa[machine_index++];
int range_count = dfa[machine_index++];
for (int r = 0; r < range_count; r++) {
int min = dfa[machine_index++];
int max = dfa[machine_index++];
if (min >= 0 && *sz != '\0') {
int c = (unsigned char)*sz;
if (c < min) {
machine_index += ((range_count - (r + 1)) * 2);
break;
}
if (c <= max) {
current_state_index = dest_state_index;
sz++;
at_line_end = (*sz == '\0') || (*sz == '\n');
at_line_start = (c == '\n');
goto found;
}
}
}
}
break;
found:
accept_id = dfa[current_state_index];
anchor_mask = dfa[current_state_index + 1];
if (accept_id != -1) {
bool anchor_valid = true;
if ((anchor_mask & 1) && !at_line_start) {
anchor_valid = false;
}
if ((anchor_mask & 2) && !at_line_end) {
anchor_valid = false;
}
if (anchor_valid) {
*data = sz;
return accept_id;
}
}
if (*sz == '\0') break;
}
*data = sz;
return -1; // No valid match found
}
int main() {
static const int identifier_dfa[] = {
-1, 0, 1, 11, 3, 65, 90, 95, 95, 97, 122,
0, 0, 1, 11, 4, 48, 57, 65, 90, 95, 95, 97, 122
};
const char* input = "hello_world123";
const char* pos = input;
int token = lex_ranged(&pos, identifier_dfa);
printf("Matched token %d, consumed: '%.*s'\n",
token, (int)(pos - input), input);
return 0;
}
Multi-Language Support
The integer arrays work in any language:
Rust:
const DFA: &[i32] = &[-1, 0, 1, 11, 3, 65, 90, 95, 95, 97, 122, /*...*/];
Python:
dfa = [-1, 0, 1, 11, 3, 65, 90, 95, 95, 97, 122, ...]
JavaScript:
const dfa = [-1, 0, 1, 11, 3, 65, 90, 95, 95, 97, 122, /*...*/];
CLI Reference
luthor
A DFA lexer generator tool
Usage:
luthor <input> [ --enc <encoding> ] [ --array <array> ] [ --graph <graph> ] [ --draft <draft> ]
<input> The input expression or file to use
<encoding> The encoding to use (ASCII, UTF-8, UTF-16, or UTF-32, or a single byte encoding). Defaults to UTF-8
<array> The type of array to generate, ranged or unranged. Defaults to auto which chooses the shorted length
<graph> Generate a DFA state graph to the specified file (requires GraphViz)
<draft> Generate a DFA state graph draft to the specified file (requires GraphViz)
luthor [ --? ]
--? Displays this help screen
# Generate UTF-8 lexer for C identifiers
luthor "[A-Za-z_][A-Za-z0-9_]*"
# Create lexer from grammar file
luthor c_lexer.txt --enc UTF-16
# Generate lexer with state diagram
luthor json_lexer.txt --graph json_states.png
Performance Characteristics
- Memory: Compact arrays, typically KB-sized even for complex lexers
- Speed: Simple array traversal, no backtracking
- Predictable: DFA guarantees O(n) matching time
- Embeddable: No external dependencies, just integer arrays
License
MIT License - Copyright (c) 2025 honey the codewitch
Lazy matching implementation based on Dr. Robert van Engelen's algorithm from RE/FLEX (all rights reserved to Dr. van Engelen)
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
This package has no dependencies.