DistSharp 0.1.1

.NET 10.0

dotnet tool install --global DistSharp --version 0.1.1

This package contains a .NET tool you can call from the shell/command line.

dotnet new tool-manifest
                    

                            if you are setting up this repo

dotnet tool install --local DistSharp --version 0.1.1

This package contains a .NET tool you can call from the shell/command line.

#tool dotnet:?package=DistSharp&version=0.1.1

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

nuke :add-package DistSharp --version 0.1.1

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

DistSharp

Roslyn-aware synthetic-data pipelines for .NET. DistSharp parses your .sln/.csproj with the real C# compiler, samples symbols by complexity, and turns them into instruction-tuning datasets via OpenAI / Anthropic / Gemini / Ollama / LM Studio — then optionally judges quality, deduplicates, and exports to JSONL, Parquet, CSV, Alpaca, ShareGPT, or pushes to Hugging Face Hub.

Inspired by distilabel and built for the .NET ecosystem. Ships as a NuGet global tool — zero install via dnx on any machine with .NET 10+.

dnx DistSharp inspect ./MyApp.sln
dnx DistSharp models --provider openai            # list available models
dnx DistSharp generate ./MyApp.sln --max-rows 5000 --provider openai
dnx DistSharp pipeline run ./distsharp.yaml
dnx DistSharp export ./distsharp-out --format alpaca --hf-repo myorg/dataset

Why DistSharp
Install
Quick start
Commands
Dataset types
LLM providers
Pipeline configuration
Architecture
Comparison with distilabel
Development
Roadmap
Acknowledgements
Licence

Why DistSharp

Generic synthetic-data tools treat C# as plain text. DistSharp uses Roslyn to extract semantic information that's then composed into prompts:

Fully-qualified type and method names (MyApp.Services.OrderService<T>.PlaceOrderAsync)
Method body, signature, and existing XML documentation
Cyclomatic complexity (real, computed via a CSharpSyntaxWalker)
Containing type, namespace, file path
Symbol kind: method / property / class / interface / record / record-struct / enum / constructor
Filters: include/exclude tests, skip generated files, namespace exclusions, min-complexity gate

With that, prompts can reference real type names and real edge cases instead of text heuristics. The resulting datasets are noticeably more grounded in the codebase they came from.

Install

Zero-install via `dnx` (recommended)

The dnx runner is built into the .NET 10 SDK. It downloads and caches the latest DistSharp package on first use.

# Latest
dnx DistSharp generate ./MyApp.sln --out-dir ./datasets

# Pin a specific version
dnx DistSharp@0.1.0 generate ./MyApp.sln --out-dir ./datasets

Install as a global tool

dotnet tool install -g DistSharp
distsharp generate ./MyApp.sln --out-dir ./datasets

Install as a local (per-repo) tool

dotnet new tool-manifest      # if you don't already have a .config/dotnet-tools.json
dotnet tool install DistSharp
dotnet distsharp generate ./MyApp.sln --out-dir ./datasets

Requirements

Requirement	Minimum
.NET SDK	10.0+
Memory	4 GB RAM (8 GB recommended for solutions with 1k+ classes)
Disk	~500 MB for the .NET 10 SDK cache + your dataset output

Quick start

# 1. Preview what DistSharp will analyse (no LLM cost)
dnx DistSharp inspect ./MyApp.sln --report ./inspect.json

# 2. Set your provider key
export OPENAI_API_KEY="sk-..."

# 3. Generate an explanation dataset
dnx DistSharp generate ./MyApp.sln \
  --provider openai \
  --model gpt-4.1-mini \
  --dataset-type explanation \
  --max-rows 5000 \
  --workers 4 \
  --out-dir ./training-data

# 4. Export to Alpaca format for Axolotl
dnx DistSharp export ./training-data \
  --format alpaca \
  --out-dir ./alpaca-data

For richer pipelines (LLM-as-judge filtering, MinHash deduplication, multiple branches) use a YAML config — see Pipeline configuration.

Commands

`inspect`

Analyses a solution and reports what DistSharp can see — no LLM calls.

dnx DistSharp inspect <solution> [--report <path>] [--show-files] [--show-symbols] [--include-tests]

Output: a Spectre.Console table with file count, symbol counts per kind, and average method complexity. --report writes the same data plus every symbol as JSON.

`generate`

The primary command. Builds a default extract → sample → llm pipeline and runs it.

dnx DistSharp generate <solution> [options]

Option	Default	Description
`--out-dir <path>`	`./distsharp-out`	Where to write the dataset
`--max-rows <n>`	`50000`	Maximum total rows to generate
`--dataset-type <type>`	`mixed`	`explanation` / `completion` / `bug-fix` / `unit-test` / `docstring` / `refactor` / `architecture-qa` / `mixed`
`--format <fmt>`	`jsonl`	`jsonl` / `parquet` / `csv`
`--provider <name>`	`openai`	See LLM providers
`--model <name>`	provider default (see below)	e.g. `gpt-4.1-mini`, `claude-sonnet-4-6`, `gemini-2.5-flash`, `qwen2.5-coder:32b`. Run `distsharp models --provider <name>` if unsure.
`--include-tests`	`false`	Include test projects in analysis
`--include-generated`	`false`	Include `*.g.cs` and similar
`--min-complexity <n>`	`3`	Skip methods below this cyclomatic complexity
`--exclude-namespaces <ns>`	—	Comma-separated namespace prefixes to skip
`--workers <n>`	`4`	Parallel LLM workers
`--seed <n>`	—	Deterministic sampling
`--dry-run`	`false`	Analyse and sample only; skip LLM calls. The sampled symbols are written to disk so you can preview coverage before spending tokens.

`models`

Lists model IDs available from a provider's discovery endpoint. Useful when you don't remember the exact model name (the LLM providers change models often).

dnx DistSharp models --provider openai
dnx DistSharp models --provider anthropic --filter haiku
dnx DistSharp models --provider gemini
dnx DistSharp models --provider ollama

--filter <text> is a case-insensitive substring filter on the model ID.

Supported per provider: openai, anthropic, gemini, azure-openai, ollama, lmstudio, openai-compatible. Each hits its native models endpoint (/v1/models, /v1beta/models, etc.) using the same auth as completion calls.

`init`

Scaffolds a distsharp.yaml from a template.

dnx DistSharp init [--name <name>] [--template <name>] [--output <path>]

Templates: dotnet-mixed, dotnet-explanation, dotnet-unit-test, custom.

`pipeline run`

Executes a pipeline defined in YAML.

dnx DistSharp pipeline run <config.yaml> [--out-dir <path>] [--max-rows <n>]

Use this when you want LLM-as-judge filtering, MinHash deduplication, multiple branches, or anything beyond the default pipeline.

`export`

Converts a JSONL dataset to another format and/or uploads to Hugging Face Hub.

dnx DistSharp export <dataset-dir> [options]

Option	Description
`--format <fmt>`	`jsonl` / `alpaca` / `sharegpt`
`--out-dir <path>`	Local output directory
`--hf-repo <repo>`	Hugging Face dataset repo (`org/name`) — creates it if missing
`--hf-token <token>`	HF API token (or set `HF_TOKEN`)
`--split <name>`	Dataset split name (default `train`)

Dataset types

Each type has a dedicated prompt builder under src/DistSharp.Core/Prompts/.

Type	What it produces	Key fields
`explanation`	Free-text explanations of methods/types	`instruction`, `context`, `response`
`completion`	Half-of-body prompt → completion	`prompt`, `completion`
`bug-fix`	LLM-introduced bug + corrected version + explanation	`buggy_code`, `fixed_code`, `explanation`
`unit-test`	xUnit test grounded in the real signature	`instruction`, `context`, `response`
`docstring`	XML `///` documentation comments	`instruction`, `code`, `response`
`refactor`	Refactored higher-complexity method + rationale	`original_code`, `refactored_code`, `explanation`
`architecture-qa`	Question/answer pair about a type's role	`question`, `answer`
`mixed`	Currently selects `explanation`; fan-out across all types is on the roadmap	—

JSON-output types (bug-fix, refactor, architecture-qa) parse the LLM response and drop rows where parsing fails. Other types pass the raw response through.

LLM providers

Set with --provider / --model (or in the YAML config).

Provider	`--provider` value	Auth
OpenAI	`openai`	`OPENAI_API_KEY`
Anthropic	`anthropic`	`ANTHROPIC_API_KEY`
Azure OpenAI	`azure-openai`	`AZURE_OPENAI_ENDPOINT` + `AZURE_OPENAI_API_KEY`
Google Gemini	`gemini`	`GEMINI_API_KEY`
Ollama (local)	`ollama`	`OLLAMA_BASE_URL` (default `http://localhost:11434`)
LM Studio (local)	`lmstudio`	`LMSTUDIO_BASE_URL` (default `http://localhost:1234`)
Any OpenAI-compatible endpoint	`openai-compatible`	`OPENAI_COMPATIBLE_BASE_URL` + `OPENAI_COMPATIBLE_API_KEY`

Default models when --model is omitted:

Provider	Default model
`openai`	`gpt-4.1-mini`
`anthropic`	`claude-haiku-4-5`
`gemini`	`gemini-2.5-flash`
`azure-openai`, `ollama`, `lmstudio`, `openai-compatible`	no default — pass `--model` (deployment name on Azure, tag on Ollama, etc.)

All providers share a single retry helper that handles HTTP 408/429/500/502/503/504 with exponential backoff, honouring Retry-After when present. Authentication, request shape, and response parsing live in provider-specific classes under src/DistSharp.Providers/.

When the API returns a model-not-found error (HTTP 404 with a known signal), DistSharp surfaces the provider's own message and appends a hint:

HTTP 404 from openai: The model `gpt5.4` does not exist or you do not have access to it.
  — run 'distsharp models --provider openai' to list available models.

Pipeline configuration

distsharp.yaml is bound into a strongly-typed PipelineConfig. Snake-case keys are mapped to PascalCase properties automatically, so the natural YAML style works.

name: my-dotnet-pipeline
version: "1"

solution:
  path: ./MyApp.sln
  include_tests: false
  include_generated: false
  min_complexity: 3
  exclude_namespaces:
    - MyApp.Migrations

steps:
  - name: extract_symbols
    type: RoslynSymbolExtractor
    config:
      symbol_kinds: [method]
      min_complexity: 3

  - name: sample
    type: StratifiedSampler
    depends_on: [extract_symbols]
    config:
      max_rows: 5000
      strategy: complexity_weighted
      seed: 42

  - name: generate
    type: LlmStep
    depends_on: [sample]
    config:
      provider: openai
      model: gpt-4.1-mini
      dataset_type: explanation
      workers: 4
      temperature: 0.7

  - name: judge
    type: LlmJudge
    depends_on: [generate]
    config:
      provider: anthropic
      model: claude-opus-4-7
      min_score: 3.5
      rubric: helpfulness_and_correctness

  - name: dedup
    type: MinHashDeduplicator
    depends_on: [judge]
    config:
      field: response
      threshold: 0.85

output:
  dir: ./training-data
  format: jsonl
  write_metadata: true
  checkpoint_every: 1000

Available step types: RoslynSymbolExtractor, StratifiedSampler, LlmStep, LlmJudge, MinHashDeduplicator.

Sampling strategies: uniform (Fisher–Yates), complexity_weighted (Efraimidis–Spirakis A-Res weighted reservoir).

Judge rubrics: helpfulness_and_correctness (default), code_quality. Or pass a full prompt via prompt:.

Architecture

┌─────────────────┐      ┌──────────────────┐      ┌──────────────────┐
│ DistSharp.Cli   │      │ DistSharp.Core   │      │ DistSharp.Roslyn │
│                 │      │                  │      │                  │
│ Program.cs      │      │ Row              │      │ RoslynSolution-  │
│ HostBuilder     │      │ IStep            │◄─────│   Analyzer       │
│ PipelineBuilder │─────►│ IPipelineExec    │      │ SymbolExtractor  │
│ Yaml provider   │      │ PipelineExecutor │      │ RoslynSymbol-    │
│ Spectre live UI │      │ IDatasetWriter   │      │   ExtractorStep  │
│ 5 commands      │      │ ICheckpointStore │      │ StratifiedSampler│
└─────────────────┘      │ LlmStep          │      └──────────────────┘
        │                │ LlmJudge         │
        │                │ MinHashDedup     │      ┌──────────────────┐
        │                │ Writers (JSONL,  │      │ DistSharp.       │
        │                │  CSV, Parquet)   │      │   Providers      │
        ├───────────────►│ Export converters│◄─────│                  │
        │                │ HuggingFaceClient│      │ ILlmProvider     │
        │                │ 7 prompt builders│      │ OpenAI / Azure / │
        │                └──────────────────┘      │  Gemini / Ollama │
        │                                          │  / LM Studio /   │
        │                                          │  OpenAI-compat / │
        │                                          │  Anthropic       │
        │                                          │ LlmProviderFactory│
        └─────────────────────────────────────────►└──────────────────┘

Pipeline execution model

Steps communicate over System.Threading.Channels. PipelineExecutor:

Topological sort — DFS with cycle detection. Throws InvalidOperationException on cycles.
Channel creation — bounded Channel<Row> (capacity 1000) per step output. Source steps get a pre-completed empty input channel.
Fan-out — a single "distributor" task per upstream step reads its output and copies each row to every consumer's input channel.
Fan-in — multi-writer bounded channel; a per-channel countdown latch (Interlocked.Decrement on fanInCounters[]) completes the writer when all distributors are done.
Sink — the terminal step's distributor drains directly to IDatasetWriter.WriteAsync, then FlushAsync is called once Task.WhenAll resolves.
Backpressure — bounded channels propagate slowness upstream automatically: a slow writer slows the last step, which slows earlier steps.
Workers within LlmStep — N concurrent tasks reading the same input channel and writing the same output channel.

Solution layout

DistSharp/
├── src/
│   ├── DistSharp.Core/        # Pipeline engine, abstractions, step library, writers, prompts
│   ├── DistSharp.Roslyn/      # Roslyn symbol extraction (isolates Microsoft.CodeAnalysis.*)
│   ├── DistSharp.Providers/   # All LLM providers (OpenAI, Anthropic, Gemini, Ollama, ...)
│   └── DistSharp.Cli/         # CLI entry point, command handlers, Spectre.Console UI
├── tests/                     # xUnit + NSubstitute + FluentAssertions per project
├── docs/                      # Design specs and implementation plans
└── .comparison/               # Direct comparison harness vs distilabel

Dependency rules: Core never references Roslyn or any LLM SDK. Roslyn depends on Core. Providers depends on Core. Cli depends on all three. This keeps abstractions clean and forces the Roslyn/LLM SDK weight to live behind interfaces.

Comparison with distilabel

DistSharp and distilabel are not direct competitors — distilabel is a general-purpose synthetic-data framework; DistSharp is a purpose-built .NET code-dataset generator.

A direct comparison was run on c:\Development\ai-roi\AiRoi.sln with both pipelines processing the same 10 methods through the same gpt-4.1-mini model with identical prompts:

Test	DistSharp	distilabel
Output equivalence	avg 651 chars	avg 640 chars (within 2%)
LLM throughput (10 rows, 4 workers/batch=4)	~45 s	66 s
Pipeline definition size	31 LoC (YAML)	97 LoC (Python)
Roslyn / cyclomatic-complexity analysis	native	❌ not supported
LLM provider coverage	7 providers	7+ providers
LLM-as-judge	`LlmJudge` step	`UltraFeedback` + variants
MinHash dedup	built-in	built-in
Output formats	JSONL / CSV / Parquet / Alpaca / ShareGPT	JSONL / CSV / Parquet / Alpaca / ShareGPT
Hugging Face Hub upload	`export --hf-repo`	`Distiset.push_to_hub`
Distribution	NuGet global tool	pip / uv
Argilla integration	❌	first-class
Multi-modal (image/audio)	❌	✔
vLLM / TGI / SGLang servers	via `openai-compatible`	first-class
Built-in step library	5 steps	50+ steps

Conclusion: for .NET code-dataset generation specifically, DistSharp matches distilabel's output quality with ~3× less config and adds semantic analysis distilabel architecturally cannot do. For everything else — preference datasets, multi-modal, embeddings, Argilla workflows — use distilabel directly.

Full comparison with raw outputs: see .comparison/COMPARISON.md.

Development

git clone https://github.com/jamesburton/DistSharp.git
cd DistSharp
dotnet build
dotnet test

131 tests across Core, Roslyn, Providers, and CLI test projects. CI runs on every push and PR (.github/workflows/ci.yml). Release publishes to NuGet on every v* git tag (.github/workflows/release.yml) — requires a NUGET_API_KEY secret on the repo.

To cut a release:

git tag v0.1.0
git push origin v0.1.0

The release workflow then runs build → test → pack → push to nuget.org → upload nupkg as a build artifact.

Roadmap

mixed dataset type fans out across all seven prompt builders in one run
Per-prompt-builder symbol-kind filter (so unit-test only ever sees methods)
Incremental mode — only re-generate rows for files changed since the last run
Cross-project context (called symbols, implemented interfaces, inheritance chain) populated on ExtractedSymbol
Preference datasets — ranked pairs (chosen/rejected) for DPO/ORPO
Embedding-based deduplication via local embedding model
MCP server mode — expose the running pipeline as a Model Context Protocol server
Cost estimation in inspect based on real token-per-symbol measurements

Acknowledgements

DistSharp's pipeline architecture is directly inspired by distilabel by Argilla. If you work in Python, use distilabel directly — it's an excellent framework with deep ecosystem integration. DistSharp exists to bring those ideas to the .NET world with native Roslyn integration.

Licence

MIT

Product	Compatible and additional computed target framework versions.
.NET	net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

This package has no dependencies.

Version	Downloads	Last Updated
0.1.1	180	5/18/2026
0.1.0	101	5/17/2026