DocumentAtom.DataIngestion
1.1.0
dotnet add package DocumentAtom.DataIngestion --version 1.1.0
NuGet\Install-Package DocumentAtom.DataIngestion -Version 1.1.0
<PackageReference Include="DocumentAtom.DataIngestion" Version="1.1.0" />
<PackageVersion Include="DocumentAtom.DataIngestion" Version="1.1.0" />
<PackageReference Include="DocumentAtom.DataIngestion" />
paket add DocumentAtom.DataIngestion --version 1.1.0
#r "nuget: DocumentAtom.DataIngestion, 1.1.0"
#:package DocumentAtom.DataIngestion@1.1.0
#addin nuget:?package=DocumentAtom.DataIngestion&version=1.1.0
#tool nuget:?package=DocumentAtom.DataIngestion&version=1.1.0
<img src="https://raw.githubusercontent.com/jchristn/DocumentAtom/refs/heads/main/assets/icon.png" width="256" height="256">
DocumentAtom
DocumentAtom provides a light, fast library for breaking input documents into constituent parts (atoms), useful for text processing, analysis, and artificial intelligence.
DocumentAtom requires that Tesseract v5.0 be installed on the host. This is required as certain document types can have embedded images which are parsed using OCR via Tesseract.
New in v1.2.x
- Data Ingestion Module (
DocumentAtom.DataIngestion) for RAG/AI pipeline integration- Unified document reader with automatic type detection
- Intelligent chunking with hierarchy preservation
- Configurable options for RAG, summarization, and large context windows
- Dependency injection support with
Microsoft.Extensions.DependencyInjection - Preserves full metadata from atoms for rich filtering in vector databases
New in v1.1.x
- Hierarchical atomization (see
BuildHierarchyin settings) - heading-based for markdown/HTML/Word, page-based for PowerPoint - Support for CSV, JSON, and XML documents
- MCP server (
DocumentAtom.McpServer) for exposing DocumentAtom operations via Model Context Protocol to AI assistants - Dependency updates and fixes
Motivation
Parsing documents and extracting constituent parts is one part science and one part black magic. If you find ways to improve processing and extraction in any way that is horizontally useful, I'd would love your feedback on ways to make this library more accurate, more useful, faster, and overall better. My goal in building this library is to make it easier to analyze input data assets and make them more consumable by other systems including analytics and artificial intelligence.
Bugs, Quality, Feedback, or Enhancement Requests
Please feel free to file issues, enhancement requests, or start discussions about use of the library, improvements, or fixes.
Types Supported
DocumentAtom supports the following input file types:
- CSV
- HTML
- JSON
- Markdown
- Microsoft Word (.docx)
- Microsoft Excel (.xlsx)
- Microsoft PowerPoint (.pptx)
- PNG images (requires Tesseract on the host)
- Rich text (.rtf)
- Text
- XML
Simple Example
Refer to the various Test projects for working examples.
The following example shows processing a markdown (.md) file.
using DocumentAtom.Core.Atoms;
using DocumentAtom.Markdown;
MarkdownProcessorSettings settings = new MarkdownProcessorSettings();
MarkdownProcessor processor = new MarkdownProcessor(_Settings);
foreach (Atom atom in processor.Extract(filename))
Console.WriteLine(atom.ToString());
Atom Types
DocumentAtom parses input data assets into a variety of Atom objects. Each Atom includes top-level metadata including:
ParentGUID- globally-unique identifier of the parent atom, or, nullGUID- globally-unique identifierType- includingText,Image,Binary,Table, andListPageNumber- where available; some document types do not explicitly indicate page numbers, and page numbers are inferred when renderedPosition- the ordinal position of theAtom, relative to othersLength- the length of theAtom's contentMD5Hash- the MD5 hash of theAtomcontentSHA1Hash- the SHA1 hash of theAtomcontentSHA256Hash- the SHA256 hash of theAtomcontentQuarks- sub-atomic particles created from theAtomcontent, for instance, when chunking text
The AtomBase class provides the aforementioned metadata, and several type-specific Atoms are returned from the various processors, including:
BinaryAtom- includes aBytespropertyDocxAtom- includesText,HeaderLevel,UnorderedList,OrderedList,Table, andBinarypropertiesImageAtom- includesBoundingBox,Text,UnorderedList,OrderedList,Table, andBinarypropertiesMarkdownAtom- includesFormatting,Text,UnorderedList,OrderedList, andTablepropertiesPdfAtom- includesBoundingBox,Text,UnorderedList,OrderedList,Table, andBinarypropertiesPptxAtom- includesTitle,Subtitle,Text,UnorderedList,OrderedList,Table, andBinarypropertiesTableAtom- includesRows,Columns,Irregular, andTablepropertiesTextAtom- includesTextXlsxAtom- includesSheetName,CellIdentifier,Text,Table, andBinaryproperties
Table objects inside of Atom objects are always presented as SerializableDataTable objects (see SerializableDataTable for more information) to provide simple serialization and conversion to native System.Data.DataTable objects.
Underlying Libraries
DocumentAtom is built on the shoulders of several libraries, without which, this work would not be possible.
- CsvHelper
- DocumentFormat.OpenXml
- HTML Agility Pack
- PdfPig
- RtfPipe
- SixLabors.ImageSharp
- Tabula
- Tesseract
Each of these libraries were integrated as NuGet packages, and no source was included or modified from these packages.
My libraries used within DocumentAtom:
Data Ingestion for RAG/AI Pipelines
The DocumentAtom.DataIngestion package provides a high-level API for processing documents and producing chunks ready for embedding and vector storage. It's designed to integrate seamlessly with RAG (Retrieval-Augmented Generation) applications and AI pipelines.
Basic Usage
using DocumentAtom.DataIngestion;
using DocumentAtom.DataIngestion.Processors;
// Create processor with RAG-optimized settings
AtomDocumentProcessorOptions options = AtomDocumentProcessorOptions.ForRag();
using AtomDocumentProcessor processor = new AtomDocumentProcessor(options);
// Process a document and get chunks
await foreach (IngestionChunk chunk in processor.ProcessAsync("document.pdf"))
{
Console.WriteLine($"Chunk {chunk.ChunkIndex}: {chunk.Content.Substring(0, 100)}...");
// Access metadata for filtering
if (chunk.Metadata.TryGetValue("atom:page_number", out object? page))
Console.WriteLine($" Page: {page}");
}
Dependency Injection
using DocumentAtom.DataIngestion.Extensions;
// In your service configuration
services.AddDocumentAtomIngestionForRag();
// Or with custom options
services.AddDocumentAtomIngestion(
reader => {
reader.EnableOcr = true;
reader.BuildHierarchy = true;
},
chunker => {
chunker.MaxChunkSize = 500;
chunker.ChunkOverlap = 50;
});
Key Features
- Automatic Type Detection: Automatically detects document type from content
- Intelligent Chunking: Preserves paragraph boundaries and header context
- Hierarchy-Aware: Maintains document structure in chunks for better retrieval
- Metadata Preservation: All atom metadata is preserved for rich filtering
- Duplicate Removal: Optional deduplication based on content hash
- Multiple Presets: Optimized configurations for RAG, summarization, and large context windows
Processing Options
| Method | Best For |
|---|---|
AtomDocumentProcessorOptions.ForRag() |
Vector database ingestion, semantic search |
AtomDocumentProcessorOptions.ForSummarization() |
Document summarization, analysis |
AtomChunkerOptions.ForLargeContext() |
Large context window models |
RESTful API and Docker
Run the DocumentAtom.Server project to start a RESTful server listening on localhost:8000. Modify the documentatom.json file to change the webserver, logging, or Tesseract settings. Alternatively, you can pull jchristn77/documentatom from Docker Hub. Refer to the Docker directory in the project for assets for running in Docker.
Refer to the Postman collection for examples exercising the APIs.
Running Locally
cd src/DocumentAtom.Server
dotnet run
Running with Docker
- Pull the image from Docker Hub:
docker pull jchristn77/documentatom:v1.1.0
Create a
documentatom.jsonconfiguration file (seeDocker/documentatom.jsonfor an example)Run the container:
# Windows
docker run -p 8000:8000 -v .\documentatom.json:/app/documentatom.json -v .\logs\:/app/logs/ jchristn77/documentatom:v1.1.0
# Linux/macOS
docker run -p 8000:8000 -v ./documentatom.json:/app/documentatom.json -v ./logs/:/app/logs/ jchristn77/documentatom:v1.1.0
Alternatively, use the provided scripts in the Docker directory:
# Windows
Dockerrun.bat v1.1.0
# Linux/macOS
IMG_TAG=v1.1.0 ./Dockerrun.sh
MCP Server and Docker
The DocumentAtom.McpServer project provides a Model Context Protocol (MCP) server that exposes DocumentAtom operations to AI assistants and LLM-based tools. The MCP server acts as a front-end to the DocumentAtom.Server RESTful API, enabling AI agents to process documents via standardized MCP tool calls.
The MCP server supports three transport protocols:
- HTTP: JSON-RPC over HTTP at
/rpc(default port 8200) - TCP: Raw TCP socket connection (default port 8201)
- WebSocket: WebSocket connection at
/mcp(default port 8202)
Prerequisites
The MCP server requires a running DocumentAtom.Server instance. Configure the endpoint in documentatom.json:
{
"DocumentAtom": {
"Endpoint": "http://localhost:8000",
"AccessKey": null
}
}
Running Locally
cd src/DocumentAtom.McpServer
dotnet run
Command-line options:
--config=<file>- Specify settings file path (default:./documentatom.json)--showconfig- Display configuration and exit--help,-h- Show help message
Running with Docker
- Pull the image from Docker Hub:
docker pull jchristn77/documentatom-mcp:v1.1.0
- Create a
documentatom.jsonconfiguration file with MCP server settings:
{
"Logging": {
"LogDirectory": "./logs/",
"LogFilename": "documentatom-mcp.log",
"ConsoleLogging": true,
"EnableColors": true,
"MinimumSeverity": 0
},
"DocumentAtom": {
"Endpoint": "http://host.docker.internal:8000",
"AccessKey": null
},
"Http": {
"Hostname": "0.0.0.0",
"Port": 8200
},
"Tcp": {
"Address": "0.0.0.0",
"Port": 8201
},
"WebSocket": {
"Hostname": "0.0.0.0",
"Port": 8202
},
"Storage": {
"BackupsDirectory": "./backups/",
"TempDirectory": "./temp/"
}
}
- Run the container:
# Windows
docker run -p 8200:8200 -p 8201:8201 -p 8202:8202 ^
-v .\documentatom.json:/app/documentatom.json ^
-v .\logs\:/app/logs/ ^
-v .\temp\:/app/temp/ ^
-v .\backups\:/app/backups/ ^
jchristn77/documentatom-mcp:v1.1.0
# Linux/macOS
docker run -p 8200:8200 -p 8201:8201 -p 8202:8202 \
-v ./documentatom.json:/app/documentatom.json \
-v ./logs/:/app/logs/ \
-v ./temp/:/app/temp/ \
-v ./backups/:/app/backups/ \
jchristn77/documentatom-mcp:v1.1.0
Alternatively, use the provided scripts in src/DocumentAtom.McpServer:
# Windows
Dockerrun.bat v1.0.0
# Linux/macOS
IMG_TAG=v1.0.0 ./Dockerrun.sh
Environment Variables
The MCP server supports the following environment variables to override configuration:
| Variable | Description |
|---|---|
DOCUMENTATOM_ENDPOINT |
DocumentAtom server endpoint URL |
DOCUMENTATOM_ACCESS_KEY |
Access key for authentication |
MCP_HTTP_HOSTNAME |
HTTP server hostname |
MCP_HTTP_PORT |
HTTP server port |
MCP_TCP_ADDRESS |
TCP server address |
MCP_TCP_PORT |
TCP server port |
MCP_WEBSOCKET_HOSTNAME |
WebSocket server hostname |
MCP_WEBSOCKET_PORT |
WebSocket server port |
CONSOLE_LOGGING |
Enable console logging (1 or 0) |
Building Docker Images
To build the Docker images locally:
# Build DocumentAtom.Server image
cd Docker
Dockerbuild.bat v1.1.0 0 # 0 = don't push, 1 = push to Docker Hub
# Build DocumentAtom.McpServer image (from src directory)
cd src
docker buildx build -f DocumentAtom.McpServer/Dockerfile --platform linux/amd64,linux/arm64/v8 --tag jchristn77/documentatom-mcp:v1.1.0 --push .
Version History
Please refer to CHANGELOG.md for version history.
Thanks
Special thanks to iconduck.com and the content authors for producing this icon.
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net10.0
- DocumentAtom (>= 1.1.0)
- DocumentAtom.Csv (>= 1.1.0)
- DocumentAtom.Excel (>= 1.1.0)
- DocumentAtom.Html (>= 1.1.1)
- DocumentAtom.Image (>= 1.1.0)
- DocumentAtom.Json (>= 1.1.0)
- DocumentAtom.Markdown (>= 1.1.0)
- DocumentAtom.Pdf (>= 1.0.36)
- DocumentAtom.PowerPoint (>= 1.1.1)
- DocumentAtom.RichText (>= 1.0.38)
- DocumentAtom.Text (>= 1.0.36)
- DocumentAtom.TypeDetection (>= 1.0.37)
- DocumentAtom.Word (>= 1.1.0)
- DocumentAtom.Xml (>= 1.1.0)
- Microsoft.Extensions.AI (>= 10.2.0)
- Microsoft.Extensions.AI.Abstractions (>= 10.2.0)
- Microsoft.Extensions.DependencyInjection.Abstractions (>= 10.0.2)
- Microsoft.Extensions.Logging.Abstractions (>= 10.0.2)
- Microsoft.Extensions.Options (>= 10.0.2)
-
net8.0
- DocumentAtom (>= 1.1.0)
- DocumentAtom.Csv (>= 1.1.0)
- DocumentAtom.Excel (>= 1.1.0)
- DocumentAtom.Html (>= 1.1.1)
- DocumentAtom.Image (>= 1.1.0)
- DocumentAtom.Json (>= 1.1.0)
- DocumentAtom.Markdown (>= 1.1.0)
- DocumentAtom.Pdf (>= 1.0.36)
- DocumentAtom.PowerPoint (>= 1.1.1)
- DocumentAtom.RichText (>= 1.0.38)
- DocumentAtom.Text (>= 1.0.36)
- DocumentAtom.TypeDetection (>= 1.0.37)
- DocumentAtom.Word (>= 1.1.0)
- DocumentAtom.Xml (>= 1.1.0)
- Microsoft.Extensions.AI (>= 10.2.0)
- Microsoft.Extensions.AI.Abstractions (>= 10.2.0)
- Microsoft.Extensions.DependencyInjection.Abstractions (>= 10.0.2)
- Microsoft.Extensions.Logging.Abstractions (>= 10.0.2)
- Microsoft.Extensions.Options (>= 10.0.2)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
| Version | Downloads | Last Updated |
|---|---|---|
| 1.1.0 | 81 | 1/21/2026 |
Initial release