Daenet.DocumentParser
1.0.4
dotnet add package Daenet.DocumentParser --version 1.0.4
NuGet\Install-Package Daenet.DocumentParser -Version 1.0.4
<PackageReference Include="Daenet.DocumentParser" Version="1.0.4" />
<PackageVersion Include="Daenet.DocumentParser" Version="1.0.4" />
<PackageReference Include="Daenet.DocumentParser" />
paket add Daenet.DocumentParser --version 1.0.4
#r "nuget: Daenet.DocumentParser, 1.0.4"
#:package Daenet.DocumentParser@1.0.4
#addin nuget:?package=Daenet.DocumentParser&version=1.0.4
#tool nuget:?package=Daenet.DocumentParser&version=1.0.4
Daenet Document Parser
The Daenet Document Parser software provides a document importer capable of handling various document types, such as .TXT, .PDF, and .DOCX, as well as images. The content from these files is extracted and saved as a .TXT file.
What Is a Parser?
A parser reads the content of a document and outputs it as a structured text file.
Supported Parsers
The Daenet Document Parser includes the following types of parsers:
- PDF Parser
- Word Parser
- TXT Parser
- Image Parser
Usage:
- All documents must be stored in an input folder specified via command-line arguments.
- The document type is automatically detected and processed accordingly.
- Unsupported documents will generate an error message.
Setting Up the Document Parser
To utilize the DocumentParser, it is crucial to provide a OcrParserConfig as well, since WordParser and PDFParser use OcrParser for extracting and reading the content of the image inside the corpus.
DocumentParserConfig configuration:
The parser uses designated input and output folders. These folders can be specified using command-line arguments:
- Use
InputFolderto define the location of input files. - Use
OutputFolderto define where the output.TXTfiles will be saved.
or an appsettings.json shown down below:
OcrParser configuaration
The OCRParser functions with the Tessarect engine for interpreting the content of an image and needs an OcrParserConfig to function.
The config contains the following parameter obtainable either per Command-Line arguments or from an appsetting.json file.
OcrParserLanguage: for the language expected to be in the image.TessarectFilePath: model file for the respective language (see Tessdataprovider).TessarectFilePath: Optional Azure Blob storage for tessdata (see Tessdata provider).
Note that the Azure blob URL needs to have a SAS token and is authorized to at least:
- Listing
- Reading
Tessdata provider
Since the OcrParser requires Tessdata to function, the TessDataProvider takes care of it. If there is no local tessdata folder containing the required tessdata file, the TessDataProvider dowloads the respective file from either an Azure blob storage or the tessdata github repository.
The TessDataProvider is created via the TessDataProviderFactory which, creates either a GitHubTessDataProvider or an AzureBlobStorageTessDataProvider depending on if any Azure blob storage URL string is provided.
Content Importer
The Content Importer can process and extract data from:
Supported File Types:
- Text files:
.TXT,.DOCX,.PDF - Image files:
.EMF,.WMF,.JPG,.JPEG,.JFIF,.JPE,.PNG,.BMP,.DIB,.RLE,.GIF,.EMZ,.WMZ,.TIF,.TIFF,.SVG,.ICO
Detailed Instructions
Word Files:
The importer can process Word files and extract the following content:
- Images*
- Tables and their content
- Text
PDF Files:
The importer can process PDF files and extract the following content:
- Images*
- Tables and their content
- Text
Image Files:
The importer interprets and extracts text from images. Critical guidelines for optimal image processing:
Avoid high-contrast or low-readability color combinations:
- Example: Yellow text on a blue background or brown text on a dark background.
Divide large images:
- Extensive information in a single image may result in misinterpretation. Split large images into smaller, readable sections.
Ensure high image quality:
- The text should be clear, without distortion or blurring effects.
Note on Image Processing:
It is essential to verify the output of image parsing, as not all images may be interpreted accurately.
The quality and accuracy of extracted images depend on the clarity and format of the input file.
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net9.0 is compatible. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net9.0
- Azure.Storage.Blobs (>= 12.26.0)
- Azure.Storage.Files.Shares (>= 12.24.0)
- DocumentFormat.OpenXml (>= 3.4.1)
- HtmlAgilityPack (>= 1.12.4)
- Microsoft.Extensions.Configuration.Abstractions (>= 10.0.1)
- Microsoft.Extensions.Logging (>= 10.0.1)
- PdfPig (>= 0.1.13)
- SkiaSharp (>= 3.119.1)
- System.Drawing.Common (>= 10.0.1)
- Tesseract (>= 5.2.0)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.