AgentSdk.Pdf
1.0.0
dotnet add package AgentSdk.Pdf --version 1.0.0
NuGet\Install-Package AgentSdk.Pdf -Version 1.0.0
<PackageReference Include="AgentSdk.Pdf" Version="1.0.0" />
<PackageVersion Include="AgentSdk.Pdf" Version="1.0.0" />
<PackageReference Include="AgentSdk.Pdf" />
paket add AgentSdk.Pdf --version 1.0.0
#r "nuget: AgentSdk.Pdf, 1.0.0"
#:package AgentSdk.Pdf@1.0.0
#addin nuget:?package=AgentSdk.Pdf&version=1.0.0
#tool nuget:?package=AgentSdk.Pdf&version=1.0.0
Cyclotron.Maf.AgentSdk.Pdf
PDF processing extensions for Cyclotron.Maf.AgentSdk. Provides PDF image extraction, content analysis, and markdown conversion using PdfPig.
Features
📄 PDF Content Analysis
- Automatic content type detection - Classifies PDFs as TextBased, ImageOnly, or Mixed
- Configurable thresholds - Customizable text/image ratio for classification
- Page-level analysis - Per-page statistics and diagnostics
- Pluggable analyzers - Keyed DI pattern for custom implementations
🖼️ PDF Image Extraction
- Embedded image extraction - Extract XObject images from PDF pages
- Image rendering - Rasterize image-only pages for vision models
- Format conversion - PNG, JPEG output with configurable quality
- Base64 encoding - Ready for Azure OpenAI GPT-4 Vision integration
- Size filtering - Configurable min/max dimensions and file size limits
- Platform support - Cross-platform with libgdiplus on Linux
📝 PDF to Markdown Conversion
- Smart layout detection - Uses DocstrumBoundingBoxes for text block detection
- Reading order preservation - Maintains logical document flow
- Debug output - Optional markdown file saving for troubleshooting
- Content-aware processing - Integrates with content analyzer for optimal results
Installation
dotnet add package AgentSdk.Pdf
Platform Requirements:
- Windows: No additional dependencies
- Linux: Install libgdiplus for image processing
# Ubuntu/Debian sudo apt-get install libgdiplus # Alpine (Docker) apk add libgdiplus - Environment Variable: Set
DOTNET_SYSTEM_DRAWING_ENABLE_UNIX_SUPPORT=1on Linux
Quick Start
using Cyclotron.Maf.AgentSdk.Options;
using Cyclotron.Maf.AgentSdk.Services;
using Microsoft.Extensions.DependencyInjection;
// Register PDF services
services.AddPdfServices();
// Use PDF content analyzer
var analyzer = serviceProvider.GetRequiredKeyedService<IPdfContentAnalyzer>("pdfpig");
var analysis = await analyzer.AnalyzeAsync("invoice.pdf", cancellationToken);
if (analysis.ContentType == PdfContentType.TextBased)
{
// Convert to markdown
var converter = serviceProvider.GetRequiredService<IPdfToMarkdownConverter>();
var markdown = await converter.ConvertToMarkdownAsync("invoice.pdf", cancellationToken);
}
else if (analysis.ContentType == PdfContentType.ImageOnly)
{
// Extract images for vision model
var extractor = serviceProvider.GetRequiredKeyedService<IPdfImageExtractor>("pdfpig");
var images = await extractor.ExtractImagesAsync("invoice.pdf", cancellationToken);
// Use with Azure OpenAI GPT-4 Vision
foreach (var image in images)
{
// image.ImageBase64 ready for DataContent
// image.MimeType for content type
}
}
Configuration
Configure PDF processing in appsettings.json or agent.config.yaml:
PDF Content Analysis
{
"PdfContentAnalysis": {
"Enabled": true,
"AnalyzerKey": "pdfpig",
"FailureStrategy": "fallback",
"TextRatioThreshold": 0.1,
"MaxPagesToAnalyze": 0,
"MinCharactersPerPage": 10,
"LogDetailedResults": false
}
}
Options:
Enabled- Enable/disable PDF content analysisAnalyzerKey- Keyed service name ("pdfpig" by default)FailureStrategy- Skip, Throw, or Fallback on errorsTextRatioThreshold- Minimum text ratio for TextBased classification (0.0 - 1.0)MaxPagesToAnalyze- Limit pages to analyze (0 = all pages)MinCharactersPerPage- Minimum characters to consider page as textLogDetailedResults- Enable detailed logging
PDF Image Extraction
{
"PdfImageExtraction": {
"Enabled": true,
"ExtractorKey": "pdfpig",
"MaxPagesToProcess": 0,
"MaxImageSizeBytes": 5242880,
"PreferredFormat": "jpeg",
"JpegQuality": 85,
"EncodeAsBase64": true,
"MinImageWidth": 50,
"MinImageHeight": 50,
"SkipTextOnlyPages": true,
"LogDetailedResults": false
}
}
Options:
Enabled- Enable/disable image extractionExtractorKey- Keyed service name ("pdfpig" by default)MaxPagesToProcess- Limit pages to process (0 = all pages)MaxImageSizeBytes- Maximum image file size (5MB default)PreferredFormat- "jpeg" or "png" (PdfPig always outputs PNG)JpegQuality- JPEG compression quality (1-100)EncodeAsBase64- Encode images as base64 stringsMinImageWidth/MinImageHeight- Minimum dimensions to extractSkipTextOnlyPages- Skip pages with text but no imagesLogDetailedResults- Enable detailed logging
PDF to Markdown Conversion
{
"PdfConversion": {
"Enabled": true,
"SaveMarkdownForDebug": false,
"OutputDirectory": "./output",
"IncludePageNumbers": true,
"PreserveParagraphStructure": true,
"IncludeTimestampInFilename": false,
"MarkdownFileExtension": ".md"
}
}
Options:
Enabled- Enable/disable markdown conversionSaveMarkdownForDebug- Save markdown files for debuggingOutputDirectory- Directory for debug markdown filesIncludePageNumbers- Add page number markers in markdownPreserveParagraphStructure- Maintain paragraph breaksIncludeTimestampInFilename- Add timestamp to debug filenamesMarkdownFileExtension- File extension for markdown files
Advanced Usage
Workflow Pattern
// 1. Analyze PDF content
var analyzer = serviceProvider.GetRequiredKeyedService<IPdfContentAnalyzer>("pdfpig");
var analysis = await analyzer.AnalyzeAsync(pdfPath, cancellationToken);
// 2. Route based on content type
switch (analysis.ContentType)
{
case PdfContentType.TextBased:
// Text extraction workflow
var converter = serviceProvider.GetRequiredService<IPdfToMarkdownConverter>();
var markdown = await converter.ConvertToMarkdownAsync(pdfPath, cancellationToken);
// Process markdown with LLM
break;
case PdfContentType.ImageOnly:
// Vision model workflow
var extractor = serviceProvider.GetRequiredKeyedService<IPdfImageExtractor>("pdfpig");
var images = await extractor.ExtractImagesAsync(pdfPath, cancellationToken);
// Process images with GPT-4 Vision
break;
case PdfContentType.Mixed:
// Hybrid workflow - use both
var markdownContent = await converter.ConvertToMarkdownAsync(pdfPath, cancellationToken);
var extractedImages = await extractor.ExtractImagesAsync(pdfPath, cancellationToken);
// Combine text and image processing
break;
}
Streaming Image Extraction
For large PDFs, use streaming extraction to process images one at a time:
var extractor = serviceProvider.GetRequiredKeyedService<IPdfImageExtractor>("pdfpig");
var imageCount = await extractor.ExtractImagesStreamAsync(
pdfStream,
"large-document.pdf",
async (image) =>
{
// Process each image as it's extracted
await ProcessImageWithVisionModelAsync(image);
// Return false to stop extraction
return true;
},
cancellationToken);
Extract from Specific Pages
var extractor = serviceProvider.GetRequiredKeyedService<IPdfImageExtractor>("pdfpig");
var pageNumbers = new[] { 1, 3, 5 }; // Extract from pages 1, 3, and 5
var images = await extractor.ExtractImagesAsync(pdfPath, pageNumbers, cancellationToken);
Custom Analyzer Implementation
Register a custom analyzer alongside the default PdfPig implementation:
services.AddKeyedSingleton<IPdfContentAnalyzer>(
"custom",
(sp, _) => new MyCustomAnalyzer(
sp.GetRequiredService<ILogger<MyCustomAnalyzer>>(),
sp.GetRequiredService<IOptions<PdfContentAnalysisOptions>>()));
// Configure to use custom analyzer
services.Configure<PdfContentAnalysisOptions>(options =>
{
options.AnalyzerKey = "custom";
});
Integration with AgentSdk
PDF services integrate seamlessly with AgentSdk workflows:
// In your workflow executor
public class InvoiceExtractionWorkflow : IInvoiceExtractionWorkflow
{
private readonly IPdfContentAnalyzer _analyzer;
private readonly IPdfToMarkdownConverter _converter;
private readonly IPdfImageExtractor _extractor;
private readonly IAgentFactory _agentFactory;
public InvoiceExtractionWorkflow(
[FromKeyedServices("pdfpig")] IPdfContentAnalyzer analyzer,
IPdfToMarkdownConverter converter,
[FromKeyedServices("pdfpig")] IPdfImageExtractor extractor,
[FromKeyedServices("extraction")] IAgentFactory agentFactory)
{
_analyzer = analyzer;
_converter = converter;
_extractor = extractor;
_agentFactory = agentFactory;
}
public async Task<WorkflowResult<InvoiceData>> ExecuteAsync(
WorkflowInput input,
CancellationToken cancellationToken = default)
{
// Analyze content
var analysis = await _analyzer.AnalyzeAsync(input.FilePath, cancellationToken);
// Extract data based on content type
var context = analysis.ContentType == PdfContentType.TextBased
? await _converter.ConvertToMarkdownAsync(input.FilePath, cancellationToken)
: string.Empty;
var images = analysis.ContentType != PdfContentType.TextBased
? await _extractor.ExtractImagesAsync(input.FilePath, cancellationToken)
: Array.Empty<ExtractedPdfImage>();
// Process with agent
await _agentFactory.CreateAgentAsync(vectorStoreId: null, cancellationToken);
var response = await _agentFactory.RunAgentWithPollingAsync(
messages: BuildMessages(context, images),
cancellationToken: cancellationToken);
return ParseInvoiceData(response);
}
}
See SpamDetection sample for complete workflow examples.
Performance Tips
- Set MaxPagesToProcess - Limit pages for large documents
- Use SkipTextOnlyPages - Skip text pages during image extraction
- Configure MaxImageSizeBytes - Limit memory usage for large images
- Use streaming extraction - Process large PDFs efficiently
- Enable content analysis caching - Analyze once, use results for routing
Troubleshooting
libgdiplus not found on Linux
# Install libgdiplus
sudo apt-get update && sudo apt-get install -y libgdiplus
# Set environment variable
export DOTNET_SYSTEM_DRAWING_ENABLE_UNIX_SUPPORT=1
Images not extracting
- Check
EnabledistrueinPdfImageExtractionconfiguration - Verify
MinImageWidthandMinImageHeightthresholds - Enable
LogDetailedResultsto see detailed extraction logs - Ensure PDF contains actual images (not scanned text rendering as images)
Poor markdown quality
- Enable
PreserveParagraphStructurefor better formatting - Use
IncludePageNumbersfor document navigation - Consider using vision model for image-heavy documents instead
Dependencies
- PdfPig v0.1.13 - PDF parsing and extraction
- System.Drawing.Common v8.0.2 - Image processing (requires libgdiplus on Linux)
- Microsoft.Extensions.* - DI, configuration, and logging abstractions
Related Packages
- AgentSdk - Core SDK with agent factories, workflow orchestration, and vector stores
- AgentSdk.HOA (planned) - Domain-specific workflows for HOA document processing
Contributing
See the main repository for contribution guidelines.
License
MIT License - see LICENSE for details.
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net8.0
- Microsoft.Extensions.Logging (>= 10.0.3)
- Microsoft.Extensions.Options (>= 10.0.3)
- Microsoft.Extensions.Options.ConfigurationExtensions (>= 10.0.3)
- Microsoft.Extensions.Options.DataAnnotations (>= 10.0.3)
- PdfPig (>= 0.1.13)
- System.Drawing.Common (>= 10.0.3)
NuGet packages (1)
Showing the top 1 NuGet packages that depend on AgentSdk.Pdf:
| Package | Downloads |
|---|---|
|
AgentSdk
A .NET SDK for building AI agent workflows using Microsoft Agent Framework (MAF) and Azure AI Foundry. Provides workflow orchestration, agent factories, vector store management, and OpenTelemetry integration. |
GitHub repositories
This package is not used by any popular GitHub repositories.
| Version | Downloads | Last Updated |
|---|---|---|
| 1.0.0 | 34 | 2/24/2026 |