Kreuzberg 4.3.7

dotnet add package Kreuzberg --version 4.3.7
                    
NuGet\Install-Package Kreuzberg -Version 4.3.7
                    
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="Kreuzberg" Version="4.3.7" />
                    
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="Kreuzberg" Version="4.3.7" />
                    
Directory.Packages.props
<PackageReference Include="Kreuzberg" />
                    
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add Kreuzberg --version 4.3.7
                    
#r "nuget: Kreuzberg, 4.3.7"
                    
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package Kreuzberg@4.3.7
                    
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=Kreuzberg&version=4.3.7
                    
Install as a Cake Addin
#tool nuget:?package=Kreuzberg&version=4.3.7
                    
Install as a Cake Tool

C#

<div align="center" style="display: flex; flex-wrap: wrap; gap: 8px; justify-content: center; margin: 20px 0;">

<a href="https://crates.io/crates/kreuzberg"> <img src="https://img.shields.io/crates/v/kreuzberg?label=Rust&color=007ec6" alt="Rust"> </a> <a href="https://hex.pm/packages/kreuzberg"> <img src="https://img.shields.io/hexpm/v/kreuzberg?label=Elixir&color=007ec6" alt="Elixir"> </a> <a href="https://pypi.org/project/kreuzberg/"> <img src="https://img.shields.io/pypi/v/kreuzberg?label=Python&color=007ec6" alt="Python"> </a> <a href="https://www.npmjs.com/package/@kreuzberg/node"> <img src="https://img.shields.io/npm/v/@kreuzberg/node?label=Node.js&color=007ec6" alt="Node.js"> </a> <a href="https://www.npmjs.com/package/@kreuzberg/wasm"> <img src="https://img.shields.io/npm/v/@kreuzberg/wasm?label=WASM&color=007ec6" alt="WASM"> </a>

<a href="https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg"> <img src="https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg?label=Java&color=007ec6" alt="Java"> </a> <a href="https://github.com/kreuzberg-dev/kreuzberg/releases"> <img src="https://img.shields.io/github/v/tag/kreuzberg-dev/kreuzberg?label=Go&color=007ec6&filter=v4.3.7" alt="Go"> </a> <a href="https://www.nuget.org/packages/Kreuzberg/"> <img src="https://img.shields.io/nuget/v/Kreuzberg?label=C%23&color=007ec6" alt="C#"> </a> <a href="https://packagist.org/packages/kreuzberg/kreuzberg"> <img src="https://img.shields.io/packagist/v/kreuzberg/kreuzberg?label=PHP&color=007ec6" alt="PHP"> </a> <a href="https://rubygems.org/gems/kreuzberg"> <img src="https://img.shields.io/gem/v/kreuzberg?label=Ruby&color=007ec6" alt="Ruby"> </a> <a href="https://github.com/kreuzberg-dev/kreuzberg/pkgs/container/kreuzberg"> <img src="https://img.shields.io/badge/Docker-007ec6?logo=docker&logoColor=white" alt="Docker"> </a>

<a href="https://github.com/kreuzberg-dev/kreuzberg/blob/main/LICENSE"> <img src="https://img.shields.io/badge/License-MIT-blue.svg" alt="License"> </a> <a href="https://docs.kreuzberg.dev"> <img src="https://img.shields.io/badge/docs-kreuzberg.dev-blue" alt="Documentation"> </a> </div>

<img width="1128" height="191" alt="Banner2" src="https://github.com/user-attachments/assets/419fc06c-8313-4324-b159-4b4d3cfce5c0" />

<div align="center" style="margin-top: 20px;"> <a href="https://discord.gg/xt9WY3GnKR"> <img height="22" src="https://img.shields.io/badge/Discord-Join%20our%20community-7289da?logo=discord&logoColor=white" alt="Discord"> </a> </div>

Extract text, tables, images, and metadata from 75+ file formats including PDF, Office documents, and images. .NET bindings with full type safety, async/await support, and .NET 10.0+ compatibility.

Installation

Package Installation

Install via NuGet:

dotnet add package Kreuzberg

Or via NuGet Package Manager:

Install-Package Kreuzberg

System Requirements

  • .NET 10.0+ required
  • Optional: ONNX Runtime version 1.24+ for embeddings support
  • Optional: Tesseract OCR for OCR functionality

Quick Start

Basic Extraction

Extract text, metadata, and structure from any supported document format:

using Kreuzberg;

var config = new ExtractionConfig
{
    UseCache = true,
    EnableQualityProcessing = true
};

var result = KreuzbergClient.ExtractFileSync("document.pdf", config);

Console.WriteLine(result.Content);
Console.WriteLine($"MIME Type: {result.MimeType}");

Common Use Cases

Extract with Custom Configuration

Most use cases benefit from configuration to control extraction behavior:

With OCR (for scanned documents):

using Kreuzberg;

var config = new ExtractionConfig
{
    Ocr = new OcrConfig
    {
        Backend = "tesseract",
        Language = "eng+deu+fra",
        TesseractConfig = new TesseractConfig
        {
            Psm = 3
        }
    }
};

var result = KreuzbergClient.ExtractFileSync("document.pdf", config);
Console.WriteLine(result.Content);
Table Extraction

See Table Extraction Guide for detailed examples.

Processing Multiple Files
using Kreuzberg;
using System.Collections.Generic;

class Program
{
    static async Task Main()
    {
        var config = new ExtractionConfig
        {
            UseCache = true,
            EnableQualityProcessing = true
        };

        var filePaths = new[]
        {
            "document1.pdf",
            "document2.pdf",
            "document3.pdf"
        };

        try
        {
            var batchResults = new List<ExtractionResult>();

            foreach (var filePath in filePaths)
            {
                var result = await KreuzbergClient.ExtractFileAsync(filePath, config);
                batchResults.Add(result);
                Console.WriteLine($"Processed {filePath}: {result.Content.Length} chars");
            }

            var tasks = filePaths.Select(path =>
                KreuzbergClient.ExtractFileAsync(path, config)
            ).ToArray();

            var results = await Task.WhenAll(tasks);

            var totalChars = results.Sum(r => r.Content.Length);
            Console.WriteLine($"Total extracted: {totalChars} characters");
        }
        catch (KreuzbergException ex)
        {
            Console.WriteLine($"Batch processing error: {ex.Message}");
        }
    }
}
Async Processing

For non-blocking document processing:

using Kreuzberg;

class Program
{
    static async Task Main()
    {
        try
        {
            var result = await KreuzbergClient.ExtractFileAsync("document.pdf");

            Console.WriteLine($"Content length: {result.Content.Length}");
            Console.WriteLine($"MIME type: {result.MimeType}");

            var tasks = new[]
            {
                KreuzbergClient.ExtractFileAsync("file1.pdf"),
                KreuzbergClient.ExtractFileAsync("file2.pdf"),
                KreuzbergClient.ExtractFileAsync("file3.pdf")
            };

            var results = await Task.WhenAll(tasks);

            foreach (var r in results)
            {
                Console.WriteLine($"Extracted {r.Content.Length} characters");
            }
        }
        catch (KreuzbergException ex)
        {
            Console.WriteLine($"Extraction failed: {ex.Message}");
        }
    }
}

Next Steps

Features

Supported File Formats (75+)

75+ file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.

Office Documents
Category Formats Capabilities
Word Processing .docx, .odt Full text, tables, images, metadata, styles
Spreadsheets .xlsx, .xlsm, .xlsb, .xls, .xla, .xlam, .xltm, .ods Sheet data, formulas, cell metadata, charts
Presentations .pptx, .ppt, .ppsx Slides, speaker notes, images, metadata
PDF .pdf Text, tables, images, metadata, OCR support
eBooks .epub, .fb2 Chapters, metadata, embedded resources
Images (OCR-Enabled)
Category Formats Features
Raster .png, .jpg, .jpeg, .gif, .webp, .bmp, .tiff, .tif OCR, table detection, EXIF metadata, dimensions, color space
Advanced .jp2, .jpx, .jpm, .mj2, .jbig2, .jb2, .pnm, .pbm, .pgm, .ppm OCR via hayro-jpeg2000 (pure Rust decoder), JBIG2 support, table detection, format-specific metadata
Vector .svg DOM parsing, embedded text, graphics metadata
Web & Data
Category Formats Features
Markup .html, .htm, .xhtml, .xml, .svg DOM parsing, metadata (Open Graph, Twitter Card), link extraction
Structured Data .json, .yaml, .yml, .toml, .csv, .tsv Schema detection, nested structures, validation
Text & Markdown .txt, .md, .markdown, .djot, .rst, .org, .rtf CommonMark, GFM, Djot, reStructuredText, Org Mode
Email & Archives
Category Formats Features
Email .eml, .msg Headers, body (HTML/plain), attachments, threading
Archives .zip, .tar, .tgz, .gz, .7z File listing, nested archives, metadata
Academic & Scientific
Category Formats Features
Citations .bib, .biblatex, .ris, .nbib, .enw, .csl Structured parsing: RIS (structured), PubMed/MEDLINE, EndNote XML (structured), BibTeX, CSL JSON
Scientific .tex, .latex, .typst, .jats, .ipynb, .docbook LaTeX, Jupyter notebooks, PubMed JATS
Documentation .opml, .pod, .mdoc, .troff Technical documentation formats

Complete Format Reference

Key Capabilities

  • Text Extraction - Extract all text content with position and formatting information

  • Metadata Extraction - Retrieve document properties, creation date, author, etc.

  • Table Extraction - Parse tables with structure and cell content preservation

  • Image Extraction - Extract embedded images and render page previews

  • OCR Support - Integrate multiple OCR backends for scanned documents

  • Async/Await - Non-blocking document processing with concurrent operations

  • Plugin System - Extensible post-processing for custom text transformation

  • Embeddings - Generate vector embeddings using ONNX Runtime models

  • Batch Processing - Efficiently process multiple documents in parallel

  • Memory Efficient - Stream large files without loading entirely into memory

  • Language Detection - Detect and support multiple languages in documents

  • Configuration - Fine-grained control over extraction behavior

Performance Characteristics

Format Speed Memory Notes
PDF (text) 10-100 MB/s ~50MB per doc Fastest extraction
Office docs 20-200 MB/s ~100MB per doc DOCX, XLSX, PPTX
Images (OCR) 1-5 MB/s Variable Depends on OCR backend
Archives 5-50 MB/s ~200MB per doc ZIP, TAR, etc.
Web formats 50-200 MB/s Streaming HTML, XML, JSON

OCR Support

Kreuzberg supports multiple OCR backends for extracting text from scanned documents and images:

  • Tesseract

  • Paddleocr

OCR Configuration Example

using Kreuzberg;

var config = new ExtractionConfig
{
    Ocr = new OcrConfig
    {
        Backend = "tesseract",
        Language = "eng+deu+fra",
        TesseractConfig = new TesseractConfig
        {
            Psm = 3
        }
    }
};

var result = KreuzbergClient.ExtractFileSync("document.pdf", config);
Console.WriteLine(result.Content);

Async Support

This binding provides full async/await support for non-blocking document processing:

using Kreuzberg;

class Program
{
    static async Task Main()
    {
        try
        {
            var result = await KreuzbergClient.ExtractFileAsync("document.pdf");

            Console.WriteLine($"Content length: {result.Content.Length}");
            Console.WriteLine($"MIME type: {result.MimeType}");

            var tasks = new[]
            {
                KreuzbergClient.ExtractFileAsync("file1.pdf"),
                KreuzbergClient.ExtractFileAsync("file2.pdf"),
                KreuzbergClient.ExtractFileAsync("file3.pdf")
            };

            var results = await Task.WhenAll(tasks);

            foreach (var r in results)
            {
                Console.WriteLine($"Extracted {r.Content.Length} characters");
            }
        }
        catch (KreuzbergException ex)
        {
            Console.WriteLine($"Extraction failed: {ex.Message}");
        }
    }
}

Plugin System

Kreuzberg supports extensible post-processing plugins for custom text transformation and filtering.

For detailed plugin documentation, visit Plugin System Guide.

Embeddings Support

Generate vector embeddings for extracted text using the built-in ONNX Runtime support. Requires ONNX Runtime installation.

Embeddings Guide

Batch Processing

Process multiple documents efficiently:

using Kreuzberg;
using System.Collections.Generic;

class Program
{
    static async Task Main()
    {
        var config = new ExtractionConfig
        {
            UseCache = true,
            EnableQualityProcessing = true
        };

        var filePaths = new[]
        {
            "document1.pdf",
            "document2.pdf",
            "document3.pdf"
        };

        try
        {
            var batchResults = new List<ExtractionResult>();

            foreach (var filePath in filePaths)
            {
                var result = await KreuzbergClient.ExtractFileAsync(filePath, config);
                batchResults.Add(result);
                Console.WriteLine($"Processed {filePath}: {result.Content.Length} chars");
            }

            var tasks = filePaths.Select(path =>
                KreuzbergClient.ExtractFileAsync(path, config)
            ).ToArray();

            var results = await Task.WhenAll(tasks);

            var totalChars = results.Sum(r => r.Content.Length);
            Console.WriteLine($"Total extracted: {totalChars} characters");
        }
        catch (KreuzbergException ex)
        {
            Console.WriteLine($"Batch processing error: {ex.Message}");
        }
    }
}

Configuration

For advanced configuration options including language detection, table extraction, OCR settings, and more:

Configuration Guide

Documentation

Contributing

Contributions are welcome! See Contributing Guide.

License

MIT License - see LICENSE file for details.

Support

Product Compatible and additional computed target framework versions.
.NET net10.0 is compatible.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.
  • net10.0

    • No dependencies.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last Updated
4.3.7 0 2/20/2026
4.3.6 37 2/19/2026
4.3.5 79 2/17/2026
4.3.4 88 2/16/2026
4.3.3 88 2/14/2026
4.3.2 93 2/13/2026
4.3.1 103 2/12/2026
4.3.0 102 2/11/2026
4.2.15 98 2/8/2026
4.2.14 89 2/7/2026
4.2.13 90 2/7/2026
4.2.12 90 2/6/2026
4.2.11 101 2/6/2026
4.2.10 92 2/5/2026
4.2.9 92 2/3/2026
4.2.8 99 2/2/2026
4.2.7 88 2/1/2026
4.2.6 104 1/31/2026
4.2.5 99 1/30/2026
4.2.4 104 1/29/2026
Loading failed

Version 4.3.7