HPD-TextExtract
0.5.5
dotnet add package HPD-TextExtract --version 0.5.5
NuGet\Install-Package HPD-TextExtract -Version 0.5.5
<PackageReference Include="HPD-TextExtract" Version="0.5.5" />
<PackageVersion Include="HPD-TextExtract" Version="0.5.5" />
<PackageReference Include="HPD-TextExtract" />
paket add HPD-TextExtract --version 0.5.5
#r "nuget: HPD-TextExtract, 0.5.5"
#:package HPD-TextExtract@0.5.5
#addin nuget:?package=HPD-TextExtract&version=0.5.5
#tool nuget:?package=HPD-TextExtract&version=0.5.5
HPD-TextExtract
.NET-native text and document extraction library with rich PDF structure, diagnostics, OCR planning, and layout-aware outputs.
Install
dotnet add package HPD-TextExtract
Use When
Use this package when an app or library needs to turn files, byte payloads, or URLs into text plus structured extraction metadata.
Supported inputs include:
- Plain text, Markdown, JSON, and XML
- HTML and web URLs
- Microsoft Word, Excel, and PowerPoint Open XML documents
- Images through an injected OCR engine
For HPD agent middleware, use HPD-Agent.TextExtraction, which builds on this package.
Quick Start
using HPD.TextExtract;
using var extractor = new TextExtractionUtility();
var result = await extractor.ExtractTextAsync("document.pdf");
if (!result.IsSuccess)
{
throw new InvalidOperationException(result.ErrorMessage);
}
Console.WriteLine(result.ExtractedText);
Binary Payloads
using HPD.TextExtract;
using HPD.TextExtract.Models;
var bytes = await File.ReadAllBytesAsync("document.pdf");
using var extractor = new TextExtractionUtility();
var result = await extractor.ExtractTextAsync(
bytes,
mimeType: MimeTypes.Pdf,
fileName: "document.pdf");
Dependency Injection
using HPD.TextExtract;
builder.Services.AddTextExtraction();
Register custom decoders or OCR engines when the built-in behavior is not enough:
using HPD.TextExtract;
using HPD.TextExtract.Decoders;
builder.Services.AddTextExtractionWithOcr<MyOcrEngine>();
PDF Notes
PDF extraction is powered by PDFium through PDFiumCore.
The PDF pipeline extracts native text, glyph geometry, font metadata, colors, embedded image regions, optional screenshots, and diagnostics. It can also plan OCR for scanned or low-quality pages when an OCR executor is configured.
PDFium is a native dependency. The NuGet dependency brings platform-specific native assets, including macOS arm64 and x64, Windows x64/x86, and Linux x64 through the upstream PDFium packages. Native libraries are deployed as sidecar runtime assets; they are not embedded inside HPD.TextExtract.dll.
Output Shape
TextExtractionResult gives the simple text view:
IsSuccessExtractedTextFileNameMimeTypeProcessingTimeErrorMessage
For richer callers, TextExtractionResult.Extraction exposes:
Content.SectionsPagesTextItemsAssetsDiagnosticsMetadata
Target Frameworks
This package targets the repo-standard modern frameworks:
net8.0net9.0net10.0
It is configured for trimming, single-file analysis, and Native AOT analysis.
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 is compatible. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net10.0
- DocumentFormat.OpenXml (>= 3.4.1)
- HtmlAgilityPack (>= 1.12.4)
- Microsoft.Extensions.DependencyInjection (>= 10.0.9)
- Microsoft.Extensions.Logging (>= 10.0.9)
- PDFiumCore (>= 150.0.7869)
- UTF.Unknown (>= 2.6.0)
-
net8.0
- DocumentFormat.OpenXml (>= 3.4.1)
- HtmlAgilityPack (>= 1.12.4)
- Microsoft.Extensions.DependencyInjection (>= 10.0.9)
- Microsoft.Extensions.Logging (>= 10.0.9)
- PDFiumCore (>= 150.0.7869)
- UTF.Unknown (>= 2.6.0)
-
net9.0
- DocumentFormat.OpenXml (>= 3.4.1)
- HtmlAgilityPack (>= 1.12.4)
- Microsoft.Extensions.DependencyInjection (>= 10.0.9)
- Microsoft.Extensions.Logging (>= 10.0.9)
- PDFiumCore (>= 150.0.7869)
- UTF.Unknown (>= 2.6.0)
NuGet packages (1)
Showing the top 1 NuGet packages that depend on HPD-TextExtract:
| Package | Downloads |
|---|---|
|
HPD-Agent.TextExtraction
HPD Agent document handling middleware and builder extensions powered by HPD-TextExtract. |
GitHub repositories
This package is not used by any popular GitHub repositories.