FieldCure.DocumentParsers.Ocr
1.2.0
dotnet add package FieldCure.DocumentParsers.Ocr --version 1.2.0
NuGet\Install-Package FieldCure.DocumentParsers.Ocr -Version 1.2.0
<PackageReference Include="FieldCure.DocumentParsers.Ocr" Version="1.2.0" />
<PackageVersion Include="FieldCure.DocumentParsers.Ocr" Version="1.2.0" />
<PackageReference Include="FieldCure.DocumentParsers.Ocr" />
paket add FieldCure.DocumentParsers.Ocr --version 1.2.0
#r "nuget: FieldCure.DocumentParsers.Ocr, 1.2.0"
#:package FieldCure.DocumentParsers.Ocr@1.2.0
#addin nuget:?package=FieldCure.DocumentParsers.Ocr&version=1.2.0
#tool nuget:?package=FieldCure.DocumentParsers.Ocr&version=1.2.0
FieldCure.DocumentParsers.Ocr
Tesseract OCR fallback for scanned PDFs.
This package plugs into FieldCure.DocumentParsers
by registering an OcrPdfParser for .pdf — when PdfPig yields no text layer
for a page, the page is rendered at 300 DPI (via PDFium) and recognized with Tesseract.
⚠ Platform: Windows only (x64 + arm64). Native Tesseract binaries are bundled for both architectures:
win-x64ships Tesseract 5.0 / Leptonica 1.82.0 (redistributed from upstream Tesseract 5.2.0), andwin-arm64ships Tesseract 5.5.2 / Leptonica 1.87.0 built from source via vcpkg and Authenticode-signed by FieldCure. The correct-arch DLLs are selected at consumer build time. The assembly is marked[SupportedOSPlatform("windows")]— cross-platform consumers will see a CA1416 warning at compile time. Linux / macOS support is planned for a future release.If you only need text from PDFs with an embedded text layer, use the core FieldCure.DocumentParsers package (pure managed, fully cross-platform).
Install
dotnet add package FieldCure.DocumentParsers.Ocr
Quick Start
using FieldCure.DocumentParsers;
using FieldCure.DocumentParsers.Ocr;
// Registers OcrPdfParser with a Tesseract engine. Dispose the engine at shutdown.
using var ocr = DocumentParserFactoryOcrExtensions.AddOcrSupport();
// Use the factory as usual — scanned pages are OCR'd automatically.
var parser = DocumentParserFactory.GetParser(".pdf")!;
var text = parser.ExtractText(File.ReadAllBytes("scanned.pdf"));
Custom engine
using var myEngine = new TesseractOcrEngine(maxPoolSize: 8);
DocumentParserFactoryOcrExtensions.AddOcrSupport(myEngine);
Implement IOcrEngine to use a different OCR backend.
How It Works
- PdfPig extracts text for each page (same pipeline as the core package).
- If a page yields less than 5% non-whitespace or fewer than 10 meaningful chars, the page is rendered at 300 DPI via PDFium.
- The rendered image is fed to the
IOcrEngine. - Korean output is post-processed to remove spurious inter-character spaces.
Included Languages
- English (
eng.traineddata) - Korean (
kor.traineddata)
Languages are auto-discovered from embedded traineddata files.
Thread Safety
TesseractOcrEngine uses an engine pool (default size: min(ProcessorCount, 4)).
Architecture support
| RID | Tesseract native | Leptonica native | Source |
|---|---|---|---|
win-x64 |
5.0 (tesseract50.dll) |
1.82.0 (leptonica-1.82.0.dll) |
redistributed from Tesseract 5.2.0 NuGet |
win-arm64 |
5.5.2 (tesseract50.dll) |
1.87.0 (leptonica-1.82.0.dll, filename only — internal version is 1.87.0) |
built via vcpkg with a slim FieldCure overlay-port (libcurl + libarchive disabled), Authenticode-signed |
The Tesseract C ABI is additive across 5.0 → 5.5 (no symbol removals or signature changes), so the wrapper's [DllImport] surface is fully compatible. build/FieldCure.DocumentParsers.Ocr.targets selects the correct-arch DLL at consumer build time by detecting host RID via $(NETCoreSdkRuntimeIdentifier), with $(Platform), $(RuntimeIdentifier), $(PlatformTarget) as override paths for explicit-arch cross builds.
PackAsTool consumers (dnx, dotnet tool)
A single tool nupkg may be fetched and run by dnx on either x64 or ARM64, but it's packed once on a single CI host. From v1.2.0 on, PackAsTool consumers get both native trees inside the tool nupkg:
tools/<tfm>/any/x64/ -- x64 Tesseract + Leptonica
tools/<tfm>/any/arm64-platform/x64/ -- ARM64 Tesseract + Leptonica
At runtime, NativeLibraryBootstrap (called from TesseractOcrEngine's constructor) inspects RuntimeInformation.ProcessArchitecture and, on ARM64, points the wrapper's LibraryLoader.CustomSearchPath at the arm64-platform/ subfolder. The wrapper's hard-coded <base>\x64\ lookup then resolves the ARM64 binaries. The bootstrap is a silent no-op for library consumers (where the chosen-arch DLL already sits in <base>\x64\ directly).
Related Packages
- FieldCure.DocumentParsers — Core text extraction
- FieldCure.DocumentParsers.Imaging — Page image rendering
License
MIT — Copyright (c) 2026 FieldCure Co., Ltd.
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net10.0
- FieldCure.DocumentParsers.Imaging (>= 1.0.0)
- Tesseract (>= 5.2.0)
-
net8.0
- FieldCure.DocumentParsers.Imaging (>= 1.0.0)
- Tesseract (>= 5.2.0)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
| Version | Downloads | Last Updated | |
|---|---|---|---|
| 1.2.0 | 94 | 5/7/2026 | |
| 1.2.0-preview.2 | 59 | 5/7/2026 | |
| 1.2.0-preview.1 | 53 | 5/7/2026 | |
| 1.1.0 | 94 | 5/7/2026 | |
| 1.0.0 | 128 | 4/20/2026 |
v1.2.0 — Closes the PackAsTool / dnx ARM64 path. v1.1.0 added ARM64 binaries to the package but build/*.targets selected only one arch into PackAsTool consumers' tool nupkg (the pack-host arch), so an x64-CI-packed dotnet tool BadImageFormatException'd on ARM64 dnx. v1.2.0 packs BOTH x64 and ARM64 native trees into PackAsTool consumers (tools/<tfm>/any/x64/ + tools/<tfm>/any/arm64-platform/x64/). New NativeLibraryBootstrap routes the Tesseract.NET wrapper's hard-coded x64\ DLL lookup to the ARM64 tree at runtime via LibraryLoader.CustomSearchPath when the process is ARM64. Library consumers are unchanged; the bootstrap is a silent no-op for non-PackAsTool deployments. Validated via cross-arch GHA workflow (windows-latest pack -> windows-11-arm install + invoke) — scanned English PDF returns 2500+ chars of recognized text on first ARM64 dnx invocation.