DiffToJson 0.4.0

dotnet tool install --global DiffToJson --version 0.4.0
                    
This package contains a .NET tool you can call from the shell/command line.
dotnet new tool-manifest
                    
if you are setting up this repo
dotnet tool install --local DiffToJson --version 0.4.0
                    
This package contains a .NET tool you can call from the shell/command line.
#tool dotnet:?package=DiffToJson&version=0.4.0
                    
nuke :add-package DiffToJson --version 0.4.0
                    

DiffToJson

Latest NuGet Version NuGet Downloads GitHub License OpenSSF Scorecard Score

A CLI for detecting and serializing Git commit Diffs and commit messages from a local Git repository to a .JSONL file.

This can be useful for preparing git commit diffs and message data for training AI/ML models or similar use cases.

NOTE: Whilst the CLI implements a Regex pattern matching based PII detector for detecting email addresses in Commit Messages and redacting them, redaction of email addresses is not guaranteed. If commit messages contain sensitive information, conduct a human review of the output .JSONL file.

Output Formats

Two output formats are available, selected via --format:

  • raw (legacy): PascalCase JSONL with flat fields — Diff, CommitMessage, RepoName, License, RepoUrl.
  • training (default): camelCase JSONL shaped for LLM post-training pipelines. Each record is a Training Example with a ChatML messages array, provenance, legal, and optionally originalAssistantMessage. See Training Example Output below.

Documented Information (raw format)

  • The Git Diff
  • The Git Commit Message associated with the diff
  • The license Name if a LICENSE.txt, LICENSE.md, or LICENSE.txt file is present in the repo directory — An LLM call is required to compute this. As a fallback "Unknown" is returned otherwise.
  • The Git project name — Obtained from the Git Repo Directory name
  • The Git Repo URL if provided by the CLI caller.

Training Example Output

When --format training (default), each line of the JSONL file is a single Training Example in camelCase:

{
  "messages": [
    {"role": "system", "content": "You are a software engineer. Write a commit message for the following diff."},
    {"role": "user", "content": "Write a commit message for the diff in the repository 'my-repo' (MIT, https://github.com/example/my-repo):\n\n<diff text>"},
    {"role": "assistant", "content": "<commit message or LLM-generated response>"}
  ],
  "provenance": {"repoName": "my-repo", "repoUrl": "https://github.com/example/my-repo"},
  "legal": {"license": "MIT"},
  "originalAssistantMessage": "<human-written message, present only with --llm-assistant-output>"
}
Field Description
messages Array of exactly 3 ChatML messages: system, user, assistant.
provenance Source repository name and URL. Present on every record.
legal License identifier for the record's source code. Present on every record.
originalAssistantMessage The original commit message preserved alongside an LLM-generated assistant message. Present only when --llm-assistant-output is enabled; absent otherwise.

See CONTEXT.md for canonical definitions of Training Example, Provenance, Legal Metadata, and Original Assistant Message.

Configuration & Requirements

System Requirements

  • Git: The git binary must be installed and available in your system's PATH.
  • Runtime: .NET 10 SDK is required for building and running the CLI. If running the CLI as a dotnet tool, only the .NET runtime is required.

LLM Setup for License Detection

To enable automatic license detection, you must provide an AI model configuration via CLI arguments:

  • --model-id: The ID of the AI model to use (Required if --license is not provided).
  • --endpoint-url: The endpoint URL of the OpenAI-compatible API (Required if --license is not provided).
  • --provider: The AI provider ID (e.g., ollama). If not specified, it defaults to OpenAI compatible provider mode.
  • --api-key: The API key for the provider (Required unless using ollama).
Supported Providers:
AI Provider Endpoint Type Supporting NuGet Package CLI Provider Id to use Notes
Ollama OpenAI Compatible OllamaSharp ollama Compatible wth Ollama Local and Ollama Cloud - Provide the desired Ollama endpoint URL. API Key required for Ollama Cloud.
Ollama Cloud OpenAI Compatible OllamaSharp ollama-cloud API Key is required.
OpenAI OpenAI Microsoft.Extensions.AI.OpenAI openai API Key is required.
OpenAI Compatible OpenAI Compatible Microsoft.Extensions.AI.OpenAI N/A Endpoint URL is required. API Key may be required by the provider.
Anthropic Anthropic Compatible Anthropic anthropic API Key is required.
Anthropic Compatible Anthropic Compatible Anthropic anthropic-compatible Endpoint URL is required. API Key may be required by the provider.

Alternatively, you can manually provide the license name using the --license flag to skip the LLM call.

Note: Provider Ids are case-insensitive.

Installation

As a .NET Tool

If you have the .NET 10 Runtime installed, you can install the CLI as a dotnet tool from the NuGet Gallery.

To install it, use:

dotnet tool install -g DiffToJson

To update it use:

dotnet tool update -g DiffToJson

To uninstall it, use:

dotnet tool uninstall -g DiffToJson

Quick Start

Without LLM - Specified license Name

diff-to-json --repo-directory "C:\path\to\your\repo" --license "[LICENSE_NAME]" -o "C:\output\folder"

OpenAI Compatible

diff-to-json --repo-directory "C:\path\to\your\repo" --model-id "[MODEL_NAME]" --endpoint-url "[OPENAI_COMPATIBLE_ENDPOINT]" --api-key "your-api-key" -o "C:\output\folder"

Ollama (Local)

You can substitute the model id for any of Ollama's Supported Models

diff-to-json --repo-directory "C:\path\to\your\repo" --model-id "qwen3.5:4b" --endpoint-url "http://localhost:11434" --provider "ollama" -o "C:\output\folder"

Note: The CLI does not automatically pull the AI model; it must exist on your device at the time the CLI calls the Ollama API endpoint.

Ollama Cloud

You can substitute the model id for any of Ollama's Cloud Supported Models

Ollama Cloud Models via Ollama CLI
diff-to-json --repo-directory "C:\path\to\your\repo" --model-id "gemma4:31b-cloud" --endpoint-url "http://localhost:11434" --api-key "your-api-key" --provider "ollama" -o "C:\output\folder"
Ollama Cloud API
diff-to-json --repo-directory "C:\path\to\your\repo" --model-id "gemma4:31b-cloud" --api-key "your-api-key" --provider "ollama-cloud" -o "C:\output\folder"

This will analyse the specified repository and create a file named {repo-name}-commits.jsonl inside the specified output folder.

Training Format with Conventional Commits Preset

diff-to-json --repo-directory "C:\path\to\your\repo" --format training --prompt-style conventional -o "C:\output\folder"

With LLM-Generated Assistant Messages

diff-to-json --repo-directory "C:\path\to\your\repo" --format training --llm-assistant-output --model-id "qwen3.5:4b" --endpoint-url "http://localhost:11434" --provider "ollama" -o "C:\output\folder"

Using Custom Prompt Overrides

diff-to-json --repo-directory "C:\path\to\your\repo" --format training --system-prompt "You are an expert Git user." --user-prompt "Summarize this diff for {repoName}: {diff}" -o "C:\output\folder"

Raw (Legacy) Format

diff-to-json --repo-directory "C:\path\to\your\repo" --format raw -o "C:\output\folder"

CLI Parameters

Parameter Name Type Optional/Required Default Notes
--repo-directory DirectoryInfo Optional Current directory The local git repository directory to analyze.
--repo-url string Optional "" The URL of the git repository to include in the JSONL output.
--model-id string Conditional "" Required if --license is not provided. The ID of the AI model to use.
--endpoint-url string Optional "" Required if --license is not provided, or if the provider is not openai, ollama-cloud, or anthropic. The endpoint URL of the API.
--api-key string Optional "" The API key for the AI provider.
--provider string Optional "" The AI provider ID. See LLM Setup.
--license string Optional "" Manually specify the license name. Skips LLM license detection.
--output / -o string Optional {repoDir}/{repoName}-commits.jsonl The output file path.
--format string Optional training Output format. training for camelCase ChatML JSONL; raw for legacy PascalCase JSONL.
--prompt-style string Optional default Prompt preset name. See Prompt Presets.
--system-prompt string Optional "" (uses preset) Override the system prompt template. Supports placeholders.
--user-prompt string Optional "" (uses preset) Override the user prompt template. Supports placeholders.
--llm-assistant-output bool Optional false Enable LLM-generated assistant messages. Requires --format training. See LLM Override.
--llm-override-prompt string Optional "" (uses user prompt) Override the prompt sent to the LLM when --llm-assistant-output is enabled. Supports placeholders. Requires --llm-assistant-output.
--redaction string Optional message PII redaction tier. See Redaction Tiers.

Cross-Option Rules

The following validators enforce constraints between flags:

Condition Outcome Message
--llm-assistant-output + --format raw Error — incompatible Error: --llm-assistant-output is not compatible with --format raw.
--llm-override-prompt set without --llm-assistant-output Error — override prompt requires override enabled Error: --llm-override-prompt requires --llm-assistant-output.
--redaction none + --llm-assistant-output Warning — proceeds but may expose PII in LLM output Warning: --redaction none combined with --llm-assistant-output may expose PII in LLM output.

Unknown placeholders in --system-prompt, --user-prompt, or --llm-override-prompt also cause an error before any records are written.

Prompt Presets

Available via --prompt-style. Each preset provides a system and user message template. Placeholders (see below) are substituted at serialization time.

Preset Name System Prompt User Prompt
default You are a software engineer. Write a commit message for the following diff. Write a commit message for the diff in the repository '{repoName}' ({license}, {repoUrl}):\n\n{diff}
conventional You are a software engineer. Write a commit message following the Conventional Commits specification. Write a Conventional Commits-style commit message for the diff in '{repoName}' ({license}, {repoUrl}):\n\n{diff}

Overrides take precedence over the selected preset: provide --system-prompt or --user-prompt to replace the respective message entirely.

Placeholders

Placeholder tokens in prompt templates are replaced with record-specific data at serialization time. Unknown placeholders cause a CLI error.

Placeholder Substituted With
{diff} The git diff content
{commitMessage} The commit message
{repoName} The repository name (directory name)
{license} The detected or manually specified license
{repoUrl} The repository URL from --repo-url

Redaction Tiers

Available via --redaction. Controls which fields are passed through the PII redactor (regex-based email redaction) before emission.

Tier CLI Value Commit Message Diff LLM Output
None none
Message (default) message Redacted
Diff diff Redacted Redacted
All all Redacted Redacted Redacted

LLM Override

When --llm-assistant-output is enabled, the assistant message of each Training Example is generated by an LLM at extraction time, rather than taken from the original commit message. The original message is preserved in originalAssistantMessage for downstream evaluation.

  • Requires --format training (see Cross-Option Rules).
  • Requires AI provider configuration (--provider, --model-id, --endpoint-url, --api-key).
  • Use --llm-override-prompt to send a different prompt to the LLM than what appears in the user message.
  • On persistent LLM failure, the record is emitted with assistant.content = null and originalAssistantMessage populated.
  • When --redaction all is set, the LLM output is also redacted after generation.

How to Build

Standard Build

Build the project using the .NET CLI:

dotnet build src/DiffToJsonCli/DiffToJsonCli.csproj

Running the Tool

You can run the tool directly from the source:

dotnet run --project src/DiffToJsonCli/DiffToJsonCli.csproj -- [args]

Publishing (Native AOT)

For high-performance execution and a standalone binary without requiring the .NET runtime, publish as Native AOT:

dotnet publish -c Release -r [runtime-identifier] -p:PublishAoT=true

Replace [runtime-identifier] with the appropriate RID for your platform (e.g., win-x64, linux-x64, osx-arm64).

Technical Details

PII Redaction

The tool uses a regex-based approach to detect and redact email addresses within commit messages to help prevent the leaking of personally identifiable information (PII). Due to the nature of regex, this is a best-effort implementation and does not guarantee 100% redaction. For sensitive git email addresses, always conduct a human review.

License Detection Logic

The tool automatically discovers license information by searching for LICENSE.md, LICENSE.txt, or LICENSE files in the repository root. If found, the content is sent to a configured LLM (via OllamaSharp or Microsoft.Extensions.AI.OpenAI) to extract the license name. If no file is found or the LLM cannot determine the license, it falls back to "Unknown".

Merge Commits

Merge commits are omitted from the output. The tool retrieves diffs via git log -p, which by default produces no diff output for merge commits. The parser skips any commit with an empty diff body, so merge commits are excluded regardless of format.

Native AOT Compatibility

The application is designed for Native AOT compatibility, ensuring fast startup times and a small deployment footprint.

Roadmap

These are some things I'd like to work towards in future versions but are not guaranteed to appear in future versions.

In no particular order:

  • AWS Bedrock support
  • Support for working with Microsoft.Extensions.Compliance.Redaction to enable support for different implementations and types of PII redaction.

Star History

<a href="https://www.star-history.com/?repos=alastairlundy%2FDiffToJson&type=date&logscale=&legend=top-left"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/chart?repos=alastairlundy/DiffToJson&type=date&theme=dark&logscale&legend=top-left" /> <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/chart?repos=alastairlundy/DiffToJson&type=date&logscale&legend=top-left" /> <img alt="Star History Chart" src="https://api.star-history.com/chart?repos=alastairlundy/DiffToJson&type=date&logscale&legend=top-left" /> </picture> </a>

License

This project contains AI-generated code and human-written code. All human written code in this project is licensed under the Apache 2.0 license.

Product Compatible and additional computed target framework versions.
.NET net10.0 is compatible.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

This package has no dependencies.

Version Downloads Last Updated
0.4.0 100 6/20/2026
0.3.0 105 6/12/2026
0.2.0 111 5/28/2026
0.1.1 102 5/27/2026

#### Additions
- Wired CLI flags, format selection, validators, and dependency injection into `Program.cs`.
- Added the ability to output Commit Message data as ChatML formatt JSONL files for easier usage with LLMs
- Added the ability to disable PII Redaction or increase the level of PII Redaction to output messages + input diffs

#### Modifications
- Renamed the output file option from `--output-file` to `--output`.
- Refactored `LicenseFileFinder.FindLicenseFile` to a non-async method with try/catch.
- Updated `Program.cs` imports to use `DiffToJsonLib.Writers` instead of `DiffToJsonLib.Redactors`.
- Updated to `CliInvoke` 2.8.2 using the `CliInvoke.Extensions` package and the `AddCliInvoke` DI extension, fixing a related `CliInvoke` issue.
- Switched PII redaction to implement`Microsoft.Extensions.Compliance.Redaction`'s ``Redactor`` abstraction  .

#### Removals
- Removed the `ChatClientCreator` helper and `CustomOllamaJsonContext` (now provided by the library as `IChatClientFactory`, `ChatClientFactory`, and `OllamaJsonContext`).