DistSharp 0.1.1

dotnet tool install --global DistSharp --version 0.1.1
                    
This package contains a .NET tool you can call from the shell/command line.
dotnet new tool-manifest
                    
if you are setting up this repo
dotnet tool install --local DistSharp --version 0.1.1
                    
This package contains a .NET tool you can call from the shell/command line.
#tool dotnet:?package=DistSharp&version=0.1.1
                    
nuke :add-package DistSharp --version 0.1.1
                    

DistSharp

CI NuGet License: MIT

Roslyn-aware synthetic-data pipelines for .NET. DistSharp parses your .sln/.csproj with the real C# compiler, samples symbols by complexity, and turns them into instruction-tuning datasets via OpenAI / Anthropic / Gemini / Ollama / LM Studio — then optionally judges quality, deduplicates, and exports to JSONL, Parquet, CSV, Alpaca, ShareGPT, or pushes to Hugging Face Hub.

Inspired by distilabel and built for the .NET ecosystem. Ships as a NuGet global tool — zero install via dnx on any machine with .NET 10+.

dnx DistSharp inspect ./MyApp.sln
dnx DistSharp models --provider openai            # list available models
dnx DistSharp generate ./MyApp.sln --max-rows 5000 --provider openai
dnx DistSharp pipeline run ./distsharp.yaml
dnx DistSharp export ./distsharp-out --format alpaca --hf-repo myorg/dataset

Contents


Why DistSharp

Generic synthetic-data tools treat C# as plain text. DistSharp uses Roslyn to extract semantic information that's then composed into prompts:

  • Fully-qualified type and method names (MyApp.Services.OrderService<T>.PlaceOrderAsync)
  • Method body, signature, and existing XML documentation
  • Cyclomatic complexity (real, computed via a CSharpSyntaxWalker)
  • Containing type, namespace, file path
  • Symbol kind: method / property / class / interface / record / record-struct / enum / constructor
  • Filters: include/exclude tests, skip generated files, namespace exclusions, min-complexity gate

With that, prompts can reference real type names and real edge cases instead of text heuristics. The resulting datasets are noticeably more grounded in the codebase they came from.


Install

The dnx runner is built into the .NET 10 SDK. It downloads and caches the latest DistSharp package on first use.

# Latest
dnx DistSharp generate ./MyApp.sln --out-dir ./datasets

# Pin a specific version
dnx DistSharp@0.1.0 generate ./MyApp.sln --out-dir ./datasets

Install as a global tool

dotnet tool install -g DistSharp
distsharp generate ./MyApp.sln --out-dir ./datasets

Install as a local (per-repo) tool

dotnet new tool-manifest      # if you don't already have a .config/dotnet-tools.json
dotnet tool install DistSharp
dotnet distsharp generate ./MyApp.sln --out-dir ./datasets

Requirements

Requirement Minimum
.NET SDK 10.0+
Memory 4 GB RAM (8 GB recommended for solutions with 1k+ classes)
Disk ~500 MB for the .NET 10 SDK cache + your dataset output

Quick start

# 1. Preview what DistSharp will analyse (no LLM cost)
dnx DistSharp inspect ./MyApp.sln --report ./inspect.json

# 2. Set your provider key
export OPENAI_API_KEY="sk-..."

# 3. Generate an explanation dataset
dnx DistSharp generate ./MyApp.sln \
  --provider openai \
  --model gpt-4.1-mini \
  --dataset-type explanation \
  --max-rows 5000 \
  --workers 4 \
  --out-dir ./training-data

# 4. Export to Alpaca format for Axolotl
dnx DistSharp export ./training-data \
  --format alpaca \
  --out-dir ./alpaca-data

For richer pipelines (LLM-as-judge filtering, MinHash deduplication, multiple branches) use a YAML config — see Pipeline configuration.


Commands

inspect

Analyses a solution and reports what DistSharp can see — no LLM calls.

dnx DistSharp inspect <solution> [--report <path>] [--show-files] [--show-symbols] [--include-tests]

Output: a Spectre.Console table with file count, symbol counts per kind, and average method complexity. --report writes the same data plus every symbol as JSON.

generate

The primary command. Builds a default extract → sample → llm pipeline and runs it.

dnx DistSharp generate <solution> [options]
Option Default Description
--out-dir <path> ./distsharp-out Where to write the dataset
--max-rows <n> 50000 Maximum total rows to generate
--dataset-type <type> mixed explanation / completion / bug-fix / unit-test / docstring / refactor / architecture-qa / mixed
--format <fmt> jsonl jsonl / parquet / csv
--provider <name> openai See LLM providers
--model <name> provider default (see below) e.g. gpt-4.1-mini, claude-sonnet-4-6, gemini-2.5-flash, qwen2.5-coder:32b. Run distsharp models --provider <name> if unsure.
--include-tests false Include test projects in analysis
--include-generated false Include *.g.cs and similar
--min-complexity <n> 3 Skip methods below this cyclomatic complexity
--exclude-namespaces <ns> Comma-separated namespace prefixes to skip
--workers <n> 4 Parallel LLM workers
--seed <n> Deterministic sampling
--dry-run false Analyse and sample only; skip LLM calls. The sampled symbols are written to disk so you can preview coverage before spending tokens.

models

Lists model IDs available from a provider's discovery endpoint. Useful when you don't remember the exact model name (the LLM providers change models often).

dnx DistSharp models --provider openai
dnx DistSharp models --provider anthropic --filter haiku
dnx DistSharp models --provider gemini
dnx DistSharp models --provider ollama

--filter <text> is a case-insensitive substring filter on the model ID.

Supported per provider: openai, anthropic, gemini, azure-openai, ollama, lmstudio, openai-compatible. Each hits its native models endpoint (/v1/models, /v1beta/models, etc.) using the same auth as completion calls.

init

Scaffolds a distsharp.yaml from a template.

dnx DistSharp init [--name <name>] [--template <name>] [--output <path>]

Templates: dotnet-mixed, dotnet-explanation, dotnet-unit-test, custom.

pipeline run

Executes a pipeline defined in YAML.

dnx DistSharp pipeline run <config.yaml> [--out-dir <path>] [--max-rows <n>]

Use this when you want LLM-as-judge filtering, MinHash deduplication, multiple branches, or anything beyond the default pipeline.

export

Converts a JSONL dataset to another format and/or uploads to Hugging Face Hub.

dnx DistSharp export <dataset-dir> [options]
Option Description
--format <fmt> jsonl / alpaca / sharegpt
--out-dir <path> Local output directory
--hf-repo <repo> Hugging Face dataset repo (org/name) — creates it if missing
--hf-token <token> HF API token (or set HF_TOKEN)
--split <name> Dataset split name (default train)

Dataset types

Each type has a dedicated prompt builder under src/DistSharp.Core/Prompts/.

Type What it produces Key fields
explanation Free-text explanations of methods/types instruction, context, response
completion Half-of-body prompt → completion prompt, completion
bug-fix LLM-introduced bug + corrected version + explanation buggy_code, fixed_code, explanation
unit-test xUnit test grounded in the real signature instruction, context, response
docstring XML /// documentation comments instruction, code, response
refactor Refactored higher-complexity method + rationale original_code, refactored_code, explanation
architecture-qa Question/answer pair about a type's role question, answer
mixed Currently selects explanation; fan-out across all types is on the roadmap

JSON-output types (bug-fix, refactor, architecture-qa) parse the LLM response and drop rows where parsing fails. Other types pass the raw response through.


LLM providers

Set with --provider / --model (or in the YAML config).

Provider --provider value Auth
OpenAI openai OPENAI_API_KEY
Anthropic anthropic ANTHROPIC_API_KEY
Azure OpenAI azure-openai AZURE_OPENAI_ENDPOINT + AZURE_OPENAI_API_KEY
Google Gemini gemini GEMINI_API_KEY
Ollama (local) ollama OLLAMA_BASE_URL (default http://localhost:11434)
LM Studio (local) lmstudio LMSTUDIO_BASE_URL (default http://localhost:1234)
Any OpenAI-compatible endpoint openai-compatible OPENAI_COMPATIBLE_BASE_URL + OPENAI_COMPATIBLE_API_KEY

Default models when --model is omitted:

Provider Default model
openai gpt-4.1-mini
anthropic claude-haiku-4-5
gemini gemini-2.5-flash
azure-openai, ollama, lmstudio, openai-compatible no default — pass --model (deployment name on Azure, tag on Ollama, etc.)

All providers share a single retry helper that handles HTTP 408/429/500/502/503/504 with exponential backoff, honouring Retry-After when present. Authentication, request shape, and response parsing live in provider-specific classes under src/DistSharp.Providers/.

When the API returns a model-not-found error (HTTP 404 with a known signal), DistSharp surfaces the provider's own message and appends a hint:

HTTP 404 from openai: The model `gpt5.4` does not exist or you do not have access to it.
  — run 'distsharp models --provider openai' to list available models.

Pipeline configuration

distsharp.yaml is bound into a strongly-typed PipelineConfig. Snake-case keys are mapped to PascalCase properties automatically, so the natural YAML style works.

name: my-dotnet-pipeline
version: "1"

solution:
  path: ./MyApp.sln
  include_tests: false
  include_generated: false
  min_complexity: 3
  exclude_namespaces:
    - MyApp.Migrations

steps:
  - name: extract_symbols
    type: RoslynSymbolExtractor
    config:
      symbol_kinds: [method]
      min_complexity: 3

  - name: sample
    type: StratifiedSampler
    depends_on: [extract_symbols]
    config:
      max_rows: 5000
      strategy: complexity_weighted
      seed: 42

  - name: generate
    type: LlmStep
    depends_on: [sample]
    config:
      provider: openai
      model: gpt-4.1-mini
      dataset_type: explanation
      workers: 4
      temperature: 0.7

  - name: judge
    type: LlmJudge
    depends_on: [generate]
    config:
      provider: anthropic
      model: claude-opus-4-7
      min_score: 3.5
      rubric: helpfulness_and_correctness

  - name: dedup
    type: MinHashDeduplicator
    depends_on: [judge]
    config:
      field: response
      threshold: 0.85

output:
  dir: ./training-data
  format: jsonl
  write_metadata: true
  checkpoint_every: 1000

Available step types: RoslynSymbolExtractor, StratifiedSampler, LlmStep, LlmJudge, MinHashDeduplicator.

Sampling strategies: uniform (Fisher–Yates), complexity_weighted (Efraimidis–Spirakis A-Res weighted reservoir).

Judge rubrics: helpfulness_and_correctness (default), code_quality. Or pass a full prompt via prompt:.


Architecture

┌─────────────────┐      ┌──────────────────┐      ┌──────────────────┐
│ DistSharp.Cli   │      │ DistSharp.Core   │      │ DistSharp.Roslyn │
│                 │      │                  │      │                  │
│ Program.cs      │      │ Row              │      │ RoslynSolution-  │
│ HostBuilder     │      │ IStep            │◄─────│   Analyzer       │
│ PipelineBuilder │─────►│ IPipelineExec    │      │ SymbolExtractor  │
│ Yaml provider   │      │ PipelineExecutor │      │ RoslynSymbol-    │
│ Spectre live UI │      │ IDatasetWriter   │      │   ExtractorStep  │
│ 5 commands      │      │ ICheckpointStore │      │ StratifiedSampler│
└─────────────────┘      │ LlmStep          │      └──────────────────┘
        │                │ LlmJudge         │
        │                │ MinHashDedup     │      ┌──────────────────┐
        │                │ Writers (JSONL,  │      │ DistSharp.       │
        │                │  CSV, Parquet)   │      │   Providers      │
        ├───────────────►│ Export converters│◄─────│                  │
        │                │ HuggingFaceClient│      │ ILlmProvider     │
        │                │ 7 prompt builders│      │ OpenAI / Azure / │
        │                └──────────────────┘      │  Gemini / Ollama │
        │                                          │  / LM Studio /   │
        │                                          │  OpenAI-compat / │
        │                                          │  Anthropic       │
        │                                          │ LlmProviderFactory│
        └─────────────────────────────────────────►└──────────────────┘

Pipeline execution model

Steps communicate over System.Threading.Channels. PipelineExecutor:

  1. Topological sort — DFS with cycle detection. Throws InvalidOperationException on cycles.
  2. Channel creation — bounded Channel<Row> (capacity 1000) per step output. Source steps get a pre-completed empty input channel.
  3. Fan-out — a single "distributor" task per upstream step reads its output and copies each row to every consumer's input channel.
  4. Fan-in — multi-writer bounded channel; a per-channel countdown latch (Interlocked.Decrement on fanInCounters[]) completes the writer when all distributors are done.
  5. Sink — the terminal step's distributor drains directly to IDatasetWriter.WriteAsync, then FlushAsync is called once Task.WhenAll resolves.
  6. Backpressure — bounded channels propagate slowness upstream automatically: a slow writer slows the last step, which slows earlier steps.
  7. Workers within LlmStep — N concurrent tasks reading the same input channel and writing the same output channel.

Solution layout

DistSharp/
├── src/
│   ├── DistSharp.Core/        # Pipeline engine, abstractions, step library, writers, prompts
│   ├── DistSharp.Roslyn/      # Roslyn symbol extraction (isolates Microsoft.CodeAnalysis.*)
│   ├── DistSharp.Providers/   # All LLM providers (OpenAI, Anthropic, Gemini, Ollama, ...)
│   └── DistSharp.Cli/         # CLI entry point, command handlers, Spectre.Console UI
├── tests/                     # xUnit + NSubstitute + FluentAssertions per project
├── docs/                      # Design specs and implementation plans
└── .comparison/               # Direct comparison harness vs distilabel

Dependency rules: Core never references Roslyn or any LLM SDK. Roslyn depends on Core. Providers depends on Core. Cli depends on all three. This keeps abstractions clean and forces the Roslyn/LLM SDK weight to live behind interfaces.


Comparison with distilabel

DistSharp and distilabel are not direct competitors — distilabel is a general-purpose synthetic-data framework; DistSharp is a purpose-built .NET code-dataset generator.

A direct comparison was run on c:\Development\ai-roi\AiRoi.sln with both pipelines processing the same 10 methods through the same gpt-4.1-mini model with identical prompts:

Test DistSharp distilabel
Output equivalence avg 651 chars avg 640 chars (within 2%)
LLM throughput (10 rows, 4 workers/batch=4) ~45 s 66 s
Pipeline definition size 31 LoC (YAML) 97 LoC (Python)
Roslyn / cyclomatic-complexity analysis native ❌ not supported
LLM provider coverage 7 providers 7+ providers
LLM-as-judge LlmJudge step UltraFeedback + variants
MinHash dedup built-in built-in
Output formats JSONL / CSV / Parquet / Alpaca / ShareGPT JSONL / CSV / Parquet / Alpaca / ShareGPT
Hugging Face Hub upload export --hf-repo Distiset.push_to_hub
Distribution NuGet global tool pip / uv
Argilla integration first-class
Multi-modal (image/audio)
vLLM / TGI / SGLang servers via openai-compatible first-class
Built-in step library 5 steps 50+ steps

Conclusion: for .NET code-dataset generation specifically, DistSharp matches distilabel's output quality with ~3× less config and adds semantic analysis distilabel architecturally cannot do. For everything else — preference datasets, multi-modal, embeddings, Argilla workflows — use distilabel directly.

Full comparison with raw outputs: see .comparison/COMPARISON.md.


Development

git clone https://github.com/jamesburton/DistSharp.git
cd DistSharp
dotnet build
dotnet test

131 tests across Core, Roslyn, Providers, and CLI test projects. CI runs on every push and PR (.github/workflows/ci.yml). Release publishes to NuGet on every v* git tag (.github/workflows/release.yml) — requires a NUGET_API_KEY secret on the repo.

To cut a release:

git tag v0.1.0
git push origin v0.1.0

The release workflow then runs build → test → pack → push to nuget.org → upload nupkg as a build artifact.


Roadmap

  • mixed dataset type fans out across all seven prompt builders in one run
  • Per-prompt-builder symbol-kind filter (so unit-test only ever sees methods)
  • Incremental mode — only re-generate rows for files changed since the last run
  • Cross-project context (called symbols, implemented interfaces, inheritance chain) populated on ExtractedSymbol
  • Preference datasets — ranked pairs (chosen/rejected) for DPO/ORPO
  • Embedding-based deduplication via local embedding model
  • MCP server mode — expose the running pipeline as a Model Context Protocol server
  • Cost estimation in inspect based on real token-per-symbol measurements

Acknowledgements

DistSharp's pipeline architecture is directly inspired by distilabel by Argilla. If you work in Python, use distilabel directly — it's an excellent framework with deep ecosystem integration. DistSharp exists to bring those ideas to the .NET world with native Roslyn integration.


Licence

MIT

Product Compatible and additional computed target framework versions.
.NET net10.0 is compatible.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

This package has no dependencies.

Version Downloads Last Updated
0.1.1 180 5/18/2026
0.1.0 101 5/17/2026