FieldCure.DocumentParsers.Ocr 1.2.0

.NET 8.0

dotnet add package FieldCure.DocumentParsers.Ocr --version 1.2.0

NuGet\Install-Package FieldCure.DocumentParsers.Ocr -Version 1.2.0

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="FieldCure.DocumentParsers.Ocr" Version="1.2.0" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="FieldCure.DocumentParsers.Ocr" Version="1.2.0" />
                    

                            Directory.Packages.props

<PackageReference Include="FieldCure.DocumentParsers.Ocr" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add FieldCure.DocumentParsers.Ocr --version 1.2.0

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: FieldCure.DocumentParsers.Ocr, 1.2.0"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package FieldCure.DocumentParsers.Ocr@1.2.0

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=FieldCure.DocumentParsers.Ocr&version=1.2.0
                    

                            Install as a Cake Addin

#tool nuget:?package=FieldCure.DocumentParsers.Ocr&version=1.2.0
                    

                            Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

FieldCure.DocumentParsers.Ocr

Tesseract OCR fallback for scanned PDFs.

This package plugs into FieldCure.DocumentParsers by registering an OcrPdfParser for .pdf — when PdfPig yields no text layer for a page, the page is rendered at 300 DPI (via PDFium) and recognized with Tesseract.

⚠ Platform: Windows only (x64 + arm64). Native Tesseract binaries are bundled for both architectures: win-x64 ships Tesseract 5.0 / Leptonica 1.82.0 (redistributed from upstream Tesseract 5.2.0), and win-arm64 ships Tesseract 5.5.2 / Leptonica 1.87.0 built from source via vcpkg and Authenticode-signed by FieldCure. The correct-arch DLLs are selected at consumer build time. The assembly is marked [SupportedOSPlatform("windows")] — cross-platform consumers will see a CA1416 warning at compile time. Linux / macOS support is planned for a future release.

If you only need text from PDFs with an embedded text layer, use the core FieldCure.DocumentParsers package (pure managed, fully cross-platform).

Install

dotnet add package FieldCure.DocumentParsers.Ocr

Quick Start

using FieldCure.DocumentParsers;
using FieldCure.DocumentParsers.Ocr;

// Registers OcrPdfParser with a Tesseract engine. Dispose the engine at shutdown.
using var ocr = DocumentParserFactoryOcrExtensions.AddOcrSupport();

// Use the factory as usual — scanned pages are OCR'd automatically.
var parser = DocumentParserFactory.GetParser(".pdf")!;
var text = parser.ExtractText(File.ReadAllBytes("scanned.pdf"));

Custom engine

using var myEngine = new TesseractOcrEngine(maxPoolSize: 8);
DocumentParserFactoryOcrExtensions.AddOcrSupport(myEngine);

Implement IOcrEngine to use a different OCR backend.

How It Works

PdfPig extracts text for each page (same pipeline as the core package).
If a page yields less than 5% non-whitespace or fewer than 10 meaningful chars, the page is rendered at 300 DPI via PDFium.
The rendered image is fed to the IOcrEngine.
Korean output is post-processed to remove spurious inter-character spaces.

Included Languages

English (eng.traineddata)
Korean (kor.traineddata)

Languages are auto-discovered from embedded traineddata files.

Thread Safety

TesseractOcrEngine uses an engine pool (default size: min(ProcessorCount, 4)).

Architecture support

RID	Tesseract native	Leptonica native	Source
`win-x64`	5.0 (`tesseract50.dll`)	1.82.0 (`leptonica-1.82.0.dll`)	redistributed from Tesseract 5.2.0 NuGet
`win-arm64`	5.5.2 (`tesseract50.dll`)	1.87.0 (`leptonica-1.82.0.dll`, filename only — internal version is 1.87.0)	built via vcpkg with a slim FieldCure overlay-port (libcurl + libarchive disabled), Authenticode-signed

The Tesseract C ABI is additive across 5.0 → 5.5 (no symbol removals or signature changes), so the wrapper's [DllImport] surface is fully compatible. build/FieldCure.DocumentParsers.Ocr.targets selects the correct-arch DLL at consumer build time by detecting host RID via $(NETCoreSdkRuntimeIdentifier), with $(Platform), $(RuntimeIdentifier), $(PlatformTarget) as override paths for explicit-arch cross builds.

PackAsTool consumers (dnx, dotnet tool)

A single tool nupkg may be fetched and run by dnx on either x64 or ARM64, but it's packed once on a single CI host. From v1.2.0 on, PackAsTool consumers get both native trees inside the tool nupkg:

tools/<tfm>/any/x64/                 -- x64 Tesseract + Leptonica
tools/<tfm>/any/arm64-platform/x64/  -- ARM64 Tesseract + Leptonica

At runtime, NativeLibraryBootstrap (called from TesseractOcrEngine's constructor) inspects RuntimeInformation.ProcessArchitecture and, on ARM64, points the wrapper's LibraryLoader.CustomSearchPath at the arm64-platform/ subfolder. The wrapper's hard-coded <base>\x64\ lookup then resolves the ARM64 binaries. The bootstrap is a silent no-op for library consumers (where the chosen-arch DLL already sits in <base>\x64\ directly).

FieldCure.DocumentParsers — Core text extraction
FieldCure.DocumentParsers.Imaging — Page image rendering

License

Product	Compatible and additional computed target framework versions.
.NET	net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

Product

.NET

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net10.0
- FieldCure.DocumentParsers.Imaging (>= 1.0.0)
- Tesseract (>= 5.2.0)
net8.0
- FieldCure.DocumentParsers.Imaging (>= 1.0.0)
- Tesseract (>= 5.2.0)

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
1.2.0	94	5/7/2026
1.2.0-preview.2	59	5/7/2026
1.2.0-preview.1	53	5/7/2026
1.1.0	94	5/7/2026
1.0.0	128	4/20/2026

v1.2.0 — Closes the PackAsTool / dnx ARM64 path. v1.1.0 added ARM64 binaries to the package but build/*.targets selected only one arch into PackAsTool consumers' tool nupkg (the pack-host arch), so an x64-CI-packed dotnet tool BadImageFormatException'd on ARM64 dnx. v1.2.0 packs BOTH x64 and ARM64 native trees into PackAsTool consumers (tools/<tfm>/any/x64/ + tools/<tfm>/any/arm64-platform/x64/). New NativeLibraryBootstrap routes the Tesseract.NET wrapper's hard-coded x64\ DLL lookup to the ARM64 tree at runtime via LibraryLoader.CustomSearchPath when the process is ARM64. Library consumers are unchanged; the bootstrap is a silent no-op for non-PackAsTool deployments. Validated via cross-arch GHA workflow (windows-latest pack -> windows-11-arm install + invoke) — scanned English PDF returns 2500+ chars of recognized text on first ARM64 dnx invocation.