FieldCure.DocumentParsers
0.2.0
See the version list below for details.
dotnet add package FieldCure.DocumentParsers --version 0.2.0
NuGet\Install-Package FieldCure.DocumentParsers -Version 0.2.0
<PackageReference Include="FieldCure.DocumentParsers" Version="0.2.0" />
<PackageVersion Include="FieldCure.DocumentParsers" Version="0.2.0" />
<PackageReference Include="FieldCure.DocumentParsers" />
paket add FieldCure.DocumentParsers --version 0.2.0
#r "nuget: FieldCure.DocumentParsers, 0.2.0"
#:package FieldCure.DocumentParsers@0.2.0
#addin nuget:?package=FieldCure.DocumentParsers&version=0.2.0
#tool nuget:?package=FieldCure.DocumentParsers&version=0.2.0
FieldCure.DocumentParsers
Lightweight document text extraction for .NET — DOCX, HWPX, XLSX, PPTX, and more. Tables are converted to markdown for LLM / RAG consumption.
Features
- DOCX — Paragraphs, tables (including nested tables), multi-run text via OpenXML SDK
- HWPX — Korean standard format (KS X 6101 / OWPML). Paragraphs, tables, multi-section support
- XLSX — Spreadsheet sheets as markdown tables with SharedString resolution
- PPTX — Slide text, tables, and speaker notes extraction
- Markdown tables — All document tables are converted to markdown with pipe escaping
- Factory pattern —
DocumentParserFactory.GetParser(".docx")returns the right parser - Zero platform dependency — Targets
net8.0, no Windows-specific APIs - Extensible — Implement
IDocumentParserand callDocumentParserFactory.Register()
Install
dotnet add package FieldCure.DocumentParsers
Quick Start
using FieldCure.DocumentParsers;
// Auto-detect parser by extension
var parser = DocumentParserFactory.GetParser(".docx");
if (parser is not null)
{
var bytes = File.ReadAllBytes("report.docx");
var text = parser.ExtractText(bytes);
Console.WriteLine(text);
}
// Check all supported extensions
foreach (var ext in DocumentParserFactory.SupportedExtensions)
Console.WriteLine(ext); // .docx, .hwpx, .xlsx, .pptx
Output Format
Paragraphs are separated by newlines. Tables are rendered as markdown:
2025 Business Plan
Please refer to the table below for details.
| Category | Q1 | Q2 |
| --- | --- | --- |
| Revenue | 100 | 150 |
| Cost | 80 | 90 |
End of report.
Pipe characters inside cells are escaped as \| to preserve table structure.
Supported Formats
| Format | Extension | Parser | Description |
|---|---|---|---|
| Word | .docx |
DocxParser |
OpenXML (Office 2007+) |
| Hangul | .hwpx |
HwpxParser |
OWPML (Hancom Office) |
| Excel | .xlsx |
XlsxParser |
OpenXML spreadsheets |
| PowerPoint | .pptx |
PptxParser |
OpenXML presentations |
PDF Support
PDF requires native libraries (PDFium). Install the separate package:
dotnet add package FieldCure.DocumentParsers.Pdf
See FieldCure.DocumentParsers.Pdf for details.
Related Packages
- FieldCure.DocumentParsers.Pdf — PDF text extraction and page rendering
- FieldCure.AssistStudio.Core — AI provider client library that uses this package for document attachments
License
MIT — Copyright (c) 2026 FieldCure Co., Ltd.
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net8.0
- DocumentFormat.OpenXml (>= 3.5.1)
NuGet packages (4)
Showing the top 4 NuGet packages that depend on FieldCure.DocumentParsers:
| Package | Downloads |
|---|---|
|
FieldCure.Ai.Providers
AI provider clients for Claude, OpenAI, Gemini, Ollama, and Groq. Shared models and streaming support. |
|
|
FieldCure.DocumentParsers.Audio
Audio transcription parser for FieldCure.DocumentParsers. Converts MP3, WAV, M4A, OGG, FLAC, and WebM audio into timestamped Markdown transcripts via Whisper.net. |
|
|
FieldCure.DocumentParsers.Pdf
PDF text extraction and page image rendering for FieldCure.DocumentParsers |
|
|
FieldCure.DocumentParsers.Imaging
PDF page image rendering for FieldCure.DocumentParsers via PDFtoImage (PDFium). Adds IMediaDocumentParser capability to the core PDF parser. |
GitHub repositories
This package is not used by any popular GitHub repositories.
# Release Notes — FieldCure.DocumentParsers
## [0.2.0] - 2026-03-25
### Added
- `XlsxParser` — XLSX spreadsheet extraction as markdown tables with SharedString resolution
- `PptxParser` — PPTX slide text, tables, and speaker notes extraction
- `IMediaDocumentParser` interface for parsers with image extraction capability
- `DocumentImage` record for extracted images with label and index
- `DocumentParserFactory.Register()` method for external parser registration (e.g., PDF)
### Changed
- `DocumentParserFactory` now uses `ConcurrentDictionary` for thread-safe dynamic registration
- `SupportedExtensions` now includes `.xlsx` and `.pptx`
## [0.1.0] - 2026-03-22
### Added
- `IDocumentParser` interface for extensible document text extraction
- `DocxParser` — DOCX text extraction with markdown table conversion (via OpenXML SDK)
- `HwpxParser` — HWPX (Korean OWPML) text extraction with markdown table conversion
- `DocumentParserFactory` — extension-based parser resolution with `SupportedExtensions` discovery
- Markdown table output with pipe escaping for LLM / RAG consumption
- NuGet package README with quick start guide