Mythosia.Documents.Pdf
1.1.0
dotnet add package Mythosia.Documents.Pdf --version 1.1.0
NuGet\Install-Package Mythosia.Documents.Pdf -Version 1.1.0
<PackageReference Include="Mythosia.Documents.Pdf" Version="1.1.0" />
<PackageVersion Include="Mythosia.Documents.Pdf" Version="1.1.0" />
<PackageReference Include="Mythosia.Documents.Pdf" />
paket add Mythosia.Documents.Pdf --version 1.1.0
#r "nuget: Mythosia.Documents.Pdf, 1.1.0"
#:package Mythosia.Documents.Pdf@1.1.0
#addin nuget:?package=Mythosia.Documents.Pdf&version=1.1.0
#tool nuget:?package=Mythosia.Documents.Pdf&version=1.1.0
Mythosia.Documents.Pdf
PDF document loader. Parses PDF files into DoclingDocument structured models via PdfPig. Provides font-size based heading detection, bullet/numbered list recognition, and spatial paragraph grouping. Supports encrypted PDFs, metadata extraction, and page number headers.
Installation
dotnet add package Mythosia.Documents.Pdf
Quick Start
using Mythosia.Documents.Pdf;
var loader = new PdfDocumentLoader();
IReadOnlyList<DoclingDocument> docs = await loader.LoadAsync("docs/manual.pdf");
string markdown = docs[0].ToMarkdown();
With RAG Pipeline
var service = new ClaudeService(apiKey, httpClient)
.WithRag(rag => rag
.AddDocuments(new PdfDocumentLoader(), "docs/manual.pdf")
);
// Or auto-select loader by extension:
var service = new ClaudeService(apiKey, httpClient)
.WithRag(rag => rag.AddDocument("docs/manual.pdf"));
Structured Extraction
The parser analyses font sizes and spatial layout to produce a structured DoclingDocument:
- Headings — text with font size exceeding the body font size (mode) by ≥15% is classified as heading level 1–3 based on size ratio.
- Lists — lines starting with bullet characters (
•,-,*, etc.) or numbered patterns (1.,a),iv.) are emitted as list items. - Paragraphs — words are grouped into lines by Y-coordinate proximity. Consecutive body-text lines are merged into a single paragraph; vertical gaps larger than 1.4× line height trigger a paragraph break.
- Fallback — if
GetWords()returns no results but raw page text exists, the text is preserved as a paragraph.
Parser Options
using Mythosia.Documents.Pdf;
var options = new PdfParserOptions
{
Password = null, // For encrypted PDFs
IncludeMetadata = true, // Extract title, author, page count
IncludePageNumbers = false, // Add page number headers
NormalizeWhitespace = true, // Collapse excessive whitespace (preserves newlines)
};
var loader = new PdfDocumentLoader(options: options);
Custom Parser
Implement IDocumentParser and pass it to the loader:
var loader = new PdfDocumentLoader(parser: new MyCustomPdfParser());
Related Packages
| Package | Description |
|---|---|
| Mythosia.Documents.Abstractions | Core abstractions (DoclingDocument, IDocumentLoader) |
| Mythosia.Documents.Office | Word / Excel / PowerPoint loaders |
| Mythosia.AI.Rag | RAG pipeline |
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net5.0 was computed. net5.0-windows was computed. net6.0 was computed. net6.0-android was computed. net6.0-ios was computed. net6.0-maccatalyst was computed. net6.0-macos was computed. net6.0-tvos was computed. net6.0-windows was computed. net7.0 was computed. net7.0-android was computed. net7.0-ios was computed. net7.0-maccatalyst was computed. net7.0-macos was computed. net7.0-tvos was computed. net7.0-windows was computed. net8.0 was computed. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
| .NET Core | netcoreapp3.0 was computed. netcoreapp3.1 was computed. |
| .NET Standard | netstandard2.1 is compatible. |
| MonoAndroid | monoandroid was computed. |
| MonoMac | monomac was computed. |
| MonoTouch | monotouch was computed. |
| Tizen | tizen60 was computed. |
| Xamarin.iOS | xamarinios was computed. |
| Xamarin.Mac | xamarinmac was computed. |
| Xamarin.TVOS | xamarintvos was computed. |
| Xamarin.WatchOS | xamarinwatchos was computed. |
-
.NETStandard 2.1
- Mythosia.Documents.Abstractions (>= 1.0.0)
- PdfPig (>= 0.1.14)
NuGet packages (1)
Showing the top 1 NuGet packages that depend on Mythosia.Documents.Pdf:
| Package | Downloads |
|---|---|
|
Mythosia.AI.Rag
RAG (Retrieval Augmented Generation) orchestration for Mythosia.AI. Implements Mythosia.AI.Rag.Abstractions v5.x. Includes RagPipeline, text splitters, context builder, OpenAI/vLLM embedding providers, hybrid search (BM25 + Vector + RRF), re-ranking (Cohere, LLM, vLLM), search gate, keyword extraction, weighted-blend final selection, progress reporting, DoclingDocument-to-RagDocument conversion, and per-query VectorFilter passthrough (StoreFilter). Depends on Mythosia.AI.Abstractions (IAIService) instead of the full Mythosia.AI implementation. |
GitHub repositories
This package is not used by any popular GitHub repositories.
v1.1.0: Structured extraction — font-size heading detection, bullet/numbered list recognition, spatial paragraph grouping. Direct metadata access (reflection removed). NormalizeWhitespace preserves newlines. Fallback for PDFs with no extractable words.