DoclingDotNet 1.1.0

There is a newer version of this package available.
See the version list below for details.
dotnet add package DoclingDotNet --version 1.1.0
                    
NuGet\Install-Package DoclingDotNet -Version 1.1.0
                    
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="DoclingDotNet" Version="1.1.0" />
                    
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="DoclingDotNet" Version="1.1.0" />
                    
Directory.Packages.props
<PackageReference Include="DoclingDotNet" />
                    
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add DoclingDotNet --version 1.1.0
                    
#r "nuget: DoclingDotNet, 1.1.0"
                    
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package DoclingDotNet@1.1.0
                    
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=DoclingDotNet&version=1.1.0
                    
Install as a Cake Addin
#tool nuget:?package=DoclingDotNet&version=1.1.0
                    
Install as a Cake Tool

DoclingDotNet

Slice0 CI NuGet version License: MIT

A high-performance, pure-managed .NET port of the Docling document parsing library. Blisteringly fast extraction of structured semantic data from PDFs, DOCX, HTML, EPUB, and more.

Overview

DoclingDotNet is a high-fidelity, cross-platform .NET port of IBM's Docling. It extracts highly structured semantic data—including paragraphs, tables, reading order, and layouts—from complex documents.

By replacing heavy Python and C++ spatial algorithms with zero-allocation, pure-managed C# code, DoclingDotNet parses complex PDFs up to 2x faster and structured formats like DOCX/HTML over 10x faster than the upstream implementation while maintaining 100% semantic behavioral parity.

It features a completely managed pipeline with native support for structured formats (DOCX, PPTX, HTML, EPUB, EML), advanced AI extraction for PDFs (via ONNX RT-DETR layout inference and Tesseract OCR), and now native Audio Speech Recognition (ASR) via Whisper.net.

Start Here

The Strangler Fig Porting Method

To safely port the massive Python and C++ ecosystem to .NET without breaking functionality, we utilized the Strangler Fig Pattern.

Instead of a complete rewrite, we established an immutable JSON data contract (SegmentedPdfPageDto) as our absolute source of truth. We then "strangled" the upstream Python codebase slice by slice:

  1. The Roots: We started by wrapping the core C++ docling-parse library with a thin P/Invoke C-ABI bridge, mapping its raw output directly into our JSON contract.
  2. The Trunk: We ripped out the slow, FFI-heavy Python layout and reading-order algorithms, replacing them with pure C# zero-allocation algorithms (custom SpatialIndex, IntervalTree, and UnionFind) that output the exact same JSON.
  3. The Branches: Finally, we bypassed the PDF pipeline entirely to parse semantic formats (DOCX, HTML, EPUB) using robust .NET native libraries (DocumentFormat.OpenXml, HtmlAgilityPack), ensuring they also produce the exact same JSON contract.

By validating every slice with an automated Semantic Parity Harness, we guaranteed that the .NET port remained 100% behaviorally identical to the upstream Python logic at every step, while allowing us to ruthlessly optimize for C# performance.

You can read the full case study here: Strangling the Python Monolith to High-Performance .NET.

Key Features

  • Universal Input: Seamlessly parses .pdf, .docx, .pptx, .xlsx, .html, .epub, .eml, .md, .csv, .wav, .mp3 and raw images.
  • Universal Output: Regardless of the input format, DoclingDotNet emits a unified, highly-structured JSON data model (SegmentedPdfPageDto).
  • Built-in Exporters: Export the unified model directly to Markdown, HTML, Plain Text, or DocTags for immediate ingestion into LLMs or Vector Databases (RAG pipelines).
  • AI-Powered Layouts: Configurable pipeline allows routing complex PDFs through local ONNX RT-DETR models to accurately group tables, figures, and paragraphs.
  • Native Audio Transcription (ASR): Configurable pipeline runs Whisper.net to instantly transcribe .wav or transcodes .mp3/.m4a files locally (requires ffmpeg on path) and hydrates them directly into the standard JSON schema with timeline tracking.
  • Cloud Native & Serverless Ready: 100% System.IO.Stream compatible. Parse documents directly from memory buffers without touching the disk.

Quick Example

using DoclingDotNet.Pipeline;
using DoclingDotNet.Export;

var runner = new DocumentConversionRunner();
var request = new PdfConversionRequest { FilePath = "sample_book.epub" }; // Or .pdf, .docx, etc.

var result = await runner.ExecuteAsync(request);

// Instantly generate RAG-ready Markdown!
var markdown = MarkdownExporter.Export(result);
Console.WriteLine(markdown);

Contributing and Architecture

If you are contributing to the core engine, please review the internal architectural documentation:

  • Project context & overview: docs/archive/Project_Overview.md
  • Current implementation state: docs/archive/Current_State.md
  • Parity Mechanism (How CI blocks drift): docs/operations/Parity_Mechanism.md
  • Main docs index: docs/README.md

Core Validation

If you are developing locally, ensure the Semantic Parity Harness passes against the ground truth corpus before opening a Pull Request:

dotnet test dotnet/DoclingDotNet.slnx --configuration Release
powershell -ExecutionPolicy Bypass -File .\scripts\test-docling-parse-cabi-smoke.ps1 -SkipConfigure
powershell -ExecutionPolicy Bypass -File .\scripts\run-docling-parity-harness.ps1 -SkipConfigure -Output .artifacts/parity/docling-parse-parity-report.json -MaxOcrDrift 0 -TextMismatchSeverity Minor -GeometryMismatchSeverity Minor -ContextMismatchSeverity Minor --max-pages 20
Product Compatible and additional computed target framework versions.
.NET net10.0 is compatible.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last Updated
1.1.1 0 3/14/2026
1.1.0 99 2/28/2026
1.0.0 89 2/27/2026