Daenet.DocumentParser 1.0.4

.NET 9.0

dotnet add package Daenet.DocumentParser --version 1.0.4

NuGet\Install-Package Daenet.DocumentParser -Version 1.0.4

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="Daenet.DocumentParser" Version="1.0.4" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="Daenet.DocumentParser" Version="1.0.4" />
                    

                            Directory.Packages.props

<PackageReference Include="Daenet.DocumentParser" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add Daenet.DocumentParser --version 1.0.4

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: Daenet.DocumentParser, 1.0.4"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package Daenet.DocumentParser@1.0.4

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=Daenet.DocumentParser&version=1.0.4
                    

                            Install as a Cake Addin

#tool nuget:?package=Daenet.DocumentParser&version=1.0.4
                    

                            Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

Daenet Document Parser

The Daenet Document Parser software provides a document importer capable of handling various document types, such as .TXT, .PDF, and .DOCX, as well as images. The content from these files is extracted and saved as a .TXT file.

What Is a Parser?

A parser reads the content of a document and outputs it as a structured text file.

Supported Parsers

The Daenet Document Parser includes the following types of parsers:

PDF Parser
Word Parser
TXT Parser
Image Parser

Usage:

All documents must be stored in an input folder specified via command-line arguments.
The document type is automatically detected and processed accordingly.
Unsupported documents will generate an error message.

Setting Up the Document Parser

To utilize the DocumentParser, it is crucial to provide a OcrParserConfig as well, since WordParser and PDFParser use OcrParser for extracting and reading the content of the image inside the corpus.

DocumentParserConfig configuration:

The parser uses designated input and output folders. These folders can be specified using command-line arguments:

Use InputFolder to define the location of input files.
Use OutputFolder to define where the output .TXT files will be saved.

or an appsettings.json shown down below:

OcrParser configuaration

The OCRParser functions with the Tessarect engine for interpreting the content of an image and needs an OcrParserConfig to function.

The config contains the following parameter obtainable either per Command-Line arguments or from an appsetting.json file.

OcrParserLanguage: for the language expected to be in the image.
TessarectFilePath: model file for the respective language (see Tessdataprovider).
TessarectFilePath: Optional Azure Blob storage for tessdata (see Tessdata provider).

alt text

Note that the Azure blob URL needs to have a SAS token and is authorized to at least:

Listing
Reading

Tessdata provider

Since the OcrParser requires Tessdata to function, the TessDataProvider takes care of it. If there is no local tessdata folder containing the required tessdata file, the TessDataProvider dowloads the respective file from either an Azure blob storage or the tessdata github repository.

The TessDataProvider is created via the TessDataProviderFactory which, creates either a GitHubTessDataProvider or an AzureBlobStorageTessDataProvider depending on if any Azure blob storage URL string is provided.

Content Importer

The Content Importer can process and extract data from:

Supported File Types:

Text files: .TXT, .DOCX, .PDF
Image files: .EMF, .WMF, .JPG, .JPEG, .JFIF, .JPE, .PNG, .BMP, .DIB, .RLE, .GIF, .EMZ, .WMZ, .TIF, .TIFF, .SVG, .ICO

Detailed Instructions

Word Files:

The importer can process Word files and extract the following content:

Images*
Tables and their content
Text

PDF Files:

The importer can process PDF files and extract the following content:

Images*
Tables and their content
Text

Image Files:

The importer interprets and extracts text from images. Critical guidelines for optimal image processing:

Avoid high-contrast or low-readability color combinations:
- Example: Yellow text on a blue background or brown text on a dark background.
Divide large images:
- Extensive information in a single image may result in misinterpretation. Split large images into smaller, readable sections.
Ensure high image quality:
- The text should be clear, without distortion or blurring effects.

Note on Image Processing:

It is essential to verify the output of image parsing, as not all images may be interpreted accurately.

The quality and accuracy of extracted images depend on the clarity and format of the input file.

Product	Compatible and additional computed target framework versions.
.NET	net9.0 is compatible. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

Product

.NET

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net9.0
- Azure.Storage.Blobs (>= 12.26.0)
- Azure.Storage.Files.Shares (>= 12.24.0)
- DocumentFormat.OpenXml (>= 3.4.1)
- HtmlAgilityPack (>= 1.12.4)
- Microsoft.Extensions.Configuration.Abstractions (>= 10.0.1)
- Microsoft.Extensions.Logging (>= 10.0.1)
- PdfPig (>= 0.1.13)
- SkiaSharp (>= 3.119.1)
- System.Drawing.Common (>= 10.0.1)
- Tesseract (>= 5.2.0)

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
1.0.4	45	2/26/2026
1.0.2	228	9/2/2025
1.0.1	227	6/19/2025
1.0.0	206	6/2/2025