WACS.WASI.NN.OnnxRuntimeGenAI 0.1.5

.NET 8.0

dotnet add package WACS.WASI.NN.OnnxRuntimeGenAI --version 0.1.5

NuGet\Install-Package WACS.WASI.NN.OnnxRuntimeGenAI -Version 0.1.5

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="WACS.WASI.NN.OnnxRuntimeGenAI" Version="0.1.5" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="WACS.WASI.NN.OnnxRuntimeGenAI" Version="0.1.5" />
                    

                            Directory.Packages.props

<PackageReference Include="WACS.WASI.NN.OnnxRuntimeGenAI" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add WACS.WASI.NN.OnnxRuntimeGenAI --version 0.1.5

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: WACS.WASI.NN.OnnxRuntimeGenAI, 0.1.5"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package WACS.WASI.NN.OnnxRuntimeGenAI@0.1.5

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=WACS.WASI.NN.OnnxRuntimeGenAI&version=0.1.5
                    

                            Install as a Cake Addin

#tool nuget:?package=WACS.WASI.NN.OnnxRuntimeGenAI&version=0.1.5
                    

                            Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

WACS.WASI.NN.OnnxRuntimeGenAI

OnnxRuntime-GenAI backend for WACS.WASI.NN. Wraps Microsoft's generative-LLM runtime — first-class tokenizer + KV cache + sampling — and surfaces it through wasi-nn as a load-by-name backend.

Where WACS.WASI.NN.OnnxRuntime serves single-shot tensor-in / tensor-out inference (image classification, embeddings, encoder-only models), this serves generative LLMs: Gemma 3, Llama 3, Qwen 2.5, Phi 4 — anything the upstream onnxruntime-genai model_builder.py script can produce.

Install

dotnet add package WACS.WASI.NN.OnnxRuntimeGenAI

How model resolution works

GenAI models ship as directories, not single ONNX files:

gemma-3-270m-it/
├── genai_config.json     <- required descriptor
├── tokenizer.json
├── tokenizer_config.json
├── special_tokens_map.json
├── model.onnx
└── model.onnx.data        <- external weights

Get a GenAI-ready model one of two ways:

# Pre-built from Hugging Face (recommended)
huggingface-cli download onnx-community/gemma-3-270m-it-ONNX \
    --local-dir ./models/gemma-3-270m-it

# Or convert your own ONNX with onnxruntime-genai's builder
python -m onnxruntime_genai.models.builder \
    -m google/gemma-3-270m-it \
    -o ./models/gemma-3-270m-it \
    -p int4

Then point the bindable at the directory:

export WACS_WASINN_GENAI_DIR=./models
ls $WACS_WASINN_GENAI_DIR
# gemma-3-270m-it  qwen2.5-1.5b-instruct  phi-4-mini

wacs run --wasip2 --bind Wacs.WASI.NN.OnnxRuntimeGenAI.dll my.wasm

Each first-level subdirectory that contains a genai_config.json is registered under its directory name. A guest call to graph.load-by-name("gemma-3-270m-it") resolves to the ./models/gemma-3-270m-it/ directory.

Two compute shapes

The backend dispatches by the first input tensor's name:

`compute(["prompt" → utf-8 bytes])` → `["response" → utf-8 bytes]`

Single-shot generation. The host:

Decodes UTF-8 bytes → prompt string
Tokenizes via GenAI's Tokenizer.Encode
Builds GeneratorParams from env-var defaults (max_length / sampling / temperature / top_p / top_k)
Runs the decode loop with GenAI's KV cache hot across GenerateNextToken calls (this is where the GenAI win materializes vs. raw ORT)
Detokenizes back to a string
Returns the generated portion (or full prompt+response with WACS_WASINN_GENAI_INCLUDE_PROMPT=1)

Best for new chat / completion guests. Streaming output isn't supported — wasi-nn's compute is a single call. The whole response arrives at once.

`compute(["input_ids" → int64 tensor])` → `["logits" → float32 tensor]`

Single forward pass. The host:

Reinterprets the int64 tensor bytes as token IDs (narrowed to int32 — GenAI uses 32-bit tokens internally)
Constructs a fresh Generator, appends the tokens, runs one GenerateNextToken, extracts the logits output
Returns the FP32 logits tensor of shape [batch, seq_len, vocab]

Stateless — each call gets a fresh generator (KV cache wiped). The guest drives its own decode loop. Useful when an existing wasi-nn ONNX guest is already structured around per-token forward passes and you want a drop-in replacement that uses GenAI's kernels.

Configuration

Env var	Default	Description
`WACS_WASINN_GENAI_DIR`	—	Root containing GenAI model subdirectories
`WACS_WASINN_GENAI_MAX_LENGTH`	512	Hard cap on prompt+response token count
`WACS_WASINN_GENAI_DO_SAMPLE`	0	`1` enables sampling (temperature / top_p / top_k)
`WACS_WASINN_GENAI_TEMPERATURE`	1.0	Sampling temperature (when `DO_SAMPLE=1`)
`WACS_WASINN_GENAI_TOP_P`	1.0	Nucleus sampling cutoff
`WACS_WASINN_GENAI_TOP_K`	50	Top-k truncation
`WACS_WASINN_GENAI_INCLUDE_PROMPT`	0	`1` returns prompt+response; default returns response only
`WACS_WASINN_GENAI_EP`	`cpu`	Execution provider: `auto` / `cpu` / `coreml` / `cuda` / `dml` / `rocm`
`WACS_WASINN_GENAI_CUDA_DEVICE`	0	CUDA device index (when `EP=cuda`)
`WACS_WASINN_GENAI_DML_DEVICE`	0	DirectML device index (when `EP=dml`)
`WACS_WASINN_GENAI_ROCM_DEVICE`	0	ROCm device index (when `EP=rocm`)

Library embedders pass an OnnxGenAIBackendOptions to the ctor instead.

Hardware acceleration

Default is CPU — hardware acceleration is opt-in via WACS_WASINN_GENAI_EP or OnnxGenAIBackendOptions.ExecutionProvider. CPU default is empirical: on osx-arm64 against gemma-3-270m-it-genai, CoreML produces correct output but runs 3-5× slower than CPU because kernel-compile + Metal-command-buffer overhead dominates the actual compute for small models. CoreML's win typically kicks in at 1B+ params.

OS	`WACS_WASINN_GENAI_EP=auto` resolves to	Notes
macOS (arm64/x64)	CoreML	osx-arm64 GenAI dylib links `CoreML.framework` — no NuGet swap
Windows	DirectML	Substitute `Microsoft.ML.OnnxRuntimeGenAI.DirectML` for full coverage
Linux	CUDA then ROCm	Substitute `.Cuda` / `.Rocm` variants for native deps
Other	CPU

EP-append failure silently falls back to CPU (FallbackToCpu = true by default) — embedders that want strict-mode set it false to surface EP-misconfiguration as WasiNNException(RuntimeError).

Enable via environment:

# Platform-best pick (CoreML on macOS, DirectML on Windows, CUDA on Linux)
WACS_WASINN_GENAI_EP=auto wacs run my.wasm --wasip2 --bind <...>

# Force a specific provider
WACS_WASINN_GENAI_EP=coreml wacs run my.wasm --wasip2 --bind <...>
WACS_WASINN_GENAI_EP=cuda WACS_WASINN_GENAI_CUDA_DEVICE=1 \
    wacs run my.wasm --wasip2 --bind <...>

# Explicitly stay on CPU (the default — no env var also gets you CPU)
WACS_WASINN_GENAI_EP=cpu wacs run my.wasm --wasip2 --bind <...>

Library embedder (typed config):

var backend = new OnnxGenAIBackend(
    name => Directory.Exists($"./models/{name}") ? $"./models/{name}" : null,
    new OnnxGenAIBackendOptions
    {
        ExecutionProvider = OnnxGenAIExecutionProvider.CoreML,
        FallbackToCpu = true,
    });

Composes with WACS.WASI.NN.OnnxRuntime

Both can be loaded in the same process. This package registers only as LoadByNameBackend, leaving Backends[ONNX] for the regular OnnxBackend:

Guest call graph.load(bytes, ONNX) → WACS.WASI.NN.OnnxRuntime
Guest call graph.load-by-name("gemma-3-270m-it") → this package

wacs run --wasip2 --wasi-nn \
         --bind Wacs.WASI.NN.OnnxRuntimeGenAI.dll \
         my.wasm

Backend choice

Use case	Package
Image classification, embeddings, encoder-only LLMs (byte-loaded ONNX)	`WACS.WASI.NN.OnnxRuntime`
Generative LLMs in ONNX/GenAI format (Gemma 3, Llama 3, Qwen 2.5, Phi 4)	WACS.WASI.NN.OnnxRuntimeGenAI (this)
Generative LLMs in GGUF format (llama.cpp models — Metal on Apple Silicon works out of the box)	`WACS.WASI.NN.LlamaSharp`
TorchScript modules (`.pt` / `.ts`, PyTorch ecosystem)	`WACS.WASI.NN.TorchSharp`

Documentation

docs/WASI_NN_USAGE.md — unified usage guide (CLI flags, env vars, programmatic embedding, worked examples)
docs/COMPONENT_CHAINING.md
Wacs.WASI/Wacs.WASI.NN/README.md — backend matrix + package layout

License

Apache-2.0

Product	Compatible and additional computed target framework versions.
.NET	net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

Product

.NET

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net8.0
- Microsoft.ML.OnnxRuntime (>= 1.26.0)
- Microsoft.ML.OnnxRuntimeGenAI (>= 0.13.2)
- WACS.WASI.NN (>= 0.4.0)

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
0.1.5	90	5/12/2026
0.1.4	105	5/11/2026
0.1.3	91	5/11/2026