FluxCurator 0.5.0
dotnet add package FluxCurator --version 0.5.0
NuGet\Install-Package FluxCurator -Version 0.5.0
<PackageReference Include="FluxCurator" Version="0.5.0" />
<PackageVersion Include="FluxCurator" Version="0.5.0" />
<PackageReference Include="FluxCurator" />
paket add FluxCurator --version 0.5.0
#r "nuget: FluxCurator, 0.5.0"
#:package FluxCurator@0.5.0
#addin nuget:?package=FluxCurator&version=0.5.0
#tool nuget:?package=FluxCurator&version=0.5.0
FluxCurator
Clean, protect, and chunk your text for RAG pipelines — no dependencies required.
Overview
FluxCurator is a text preprocessing library for RAG (Retrieval-Augmented Generation) pipelines. It provides multilingual PII masking, content filtering, and intelligent text chunking with support for 14 languages and 13 countries' national IDs.
Zero Dependencies Philosophy: Core functionality (FluxCurator.Core) works standalone with no external dependencies. The main package (FluxCurator) adds optional LocalEmbedder integration for semantic chunking.
Features
- Text Refinement - Clean noisy text by removing blank lines, duplicates, empty list markers, and custom patterns
- Multilingual PII Masking - Auto-detect and mask emails, phones, national IDs, credit cards across 14 languages
- Content Filtering - Filter harmful content with customizable rules and blocklists
- Smart Chunking - Rule-based chunking (sentence, paragraph, token)
- Semantic Chunking - Embedding-based chunking for semantic boundaries
- Hierarchical Chunking - Document structure-aware chunking with parent-child relationships
- Multi-Language Support - 14 languages including Korean, English, Japanese, Chinese, Vietnamese, Thai
- National ID Validation - Checksum validation for 13 countries including SSN (US), RRN (Korea), Aadhaar (India), SIN (Canada)
- Streaming Support - Memory-efficient streaming chunk generation via
ChunkStreamAsync - Pipeline Processing - Combine filtering, masking, and chunking in one call
- Dependency Injection - Full DI support with
IServiceCollectionextensions - FileFlux Integration - Seamless integration with FileFlux document processing
Installation
# Main package (includes LocalEmbedder for semantic chunking)
dotnet add package FluxCurator
# Core package only (zero dependencies)
dotnet add package FluxCurator.Core
Quick Start
Basic Chunking
using FluxCurator;
using FluxCurator.Core.Domain;
// Create curator with default options
var curator = new FluxCurator();
// Chunk text using sentence strategy
var chunks = await curator.ChunkAsync(text);
foreach (var chunk in chunks)
{
Console.WriteLine($"Chunk {chunk.Index + 1}/{chunk.TotalChunks}:");
Console.WriteLine(chunk.Content);
Console.WriteLine($"Tokens: ~{chunk.Metadata.EstimatedTokenCount}");
}
Streaming Chunks
// Memory-efficient streaming for large texts
var curator = new FluxCurator();
await foreach (var chunk in curator.ChunkStreamAsync(largeText))
{
// Process chunks as they are generated
Console.WriteLine($"Chunk {chunk.Index}: {chunk.Content.Length} chars");
await ProcessChunkAsync(chunk);
}
Dependency Injection
// Program.cs or Startup.cs
services.AddFluxCurator(options =>
{
options.DefaultChunkOptions = ChunkOptions.ForRAG;
options.EnablePIIMasking = true;
options.EnableContentFiltering = true;
});
// Or with LocalEmbedder for semantic chunking
services.AddFluxCuratorWithLocalEmbedder(options =>
{
options.DefaultChunkOptions = new ChunkOptions
{
Strategy = ChunkingStrategy.Semantic,
TargetChunkSize = 512
};
});
Using IChunkerFactory
// Inject IChunkerFactory for flexible chunker creation
public class MyService
{
private readonly IChunkerFactory _chunkerFactory;
public MyService(IChunkerFactory chunkerFactory)
{
_chunkerFactory = chunkerFactory;
}
public async Task<IReadOnlyList<DocumentChunk>> ProcessAsync(string text)
{
// Create specific chunker
var chunker = _chunkerFactory.CreateChunker(ChunkingStrategy.Hierarchical);
return await chunker.ChunkAsync(text, ChunkOptions.Default);
}
}
Text Refinement
// Clean noisy text before processing
var curator = new FluxCurator()
.WithTextRefinement(TextRefineOptions.Standard);
var result = await curator.PreprocessAsync(rawText);
// Pipeline: Refine → Filter → Mask → Chunk
// Use presets for specific content types
TextRefineOptions.Light // Minimal: empty list markers, trim, collapse blanks
TextRefineOptions.Standard // Default: + remove duplicates
TextRefineOptions.ForWebContent // Web-optimized: aggressive cleaning
TextRefineOptions.ForKorean // Korean: removes 댓글 sections, copyright
TextRefineOptions.ForPdfContent // PDF: removes page numbers
// Custom patterns
var options = new TextRefineOptions
{
RemoveBlankLines = true,
RemoveDuplicateLines = true,
RemoveEmptyListItems = true, // Supports Korean markers: ㅇ, ○, ●, □, ■
TrimLines = true,
RemovePatterns = [@"^#\s*댓글\s*$", @"^\[광고\].*$"]
};
PII Masking
// Enable PII masking
var curator = new FluxCurator()
.WithPIIMasking();
// Mask PII in text
var result = curator.MaskPII("Contact: 010-1234-5678, Email: test@example.com");
Console.WriteLine(result.MaskedText);
// Output: "Contact: [PHONE], Email: [EMAIL]"
Multilingual National ID Detection
// Auto-detect PII for all supported languages
var curator = new FluxCurator()
.WithPIIMasking(PIIMaskingOptions.Default);
var result = curator.MaskPII("SSN: 123-45-6789, RRN: 901231-1234567");
// Output: "SSN: [NATIONAL_ID], RRN: [NATIONAL_ID]"
// Detect for specific language
var koreanCurator = new FluxCurator()
.WithPIIMasking(PIIMaskingOptions.ForLanguage("ko"));
var krResult = koreanCurator.MaskPII("주민등록번호: 901231-1234567");
// Output: "주민등록번호: [NATIONAL_ID]"
// Validates using Modulo-11 checksum algorithm
// Detect for multiple languages
var multiCurator = new FluxCurator()
.WithPIIMasking(PIIMaskingOptions.ForLanguages("en-US", "ko", "pt-BR"));
Hierarchical Chunking
var curator = new FluxCurator()
.WithChunkingOptions(opt =>
{
opt.Strategy = ChunkingStrategy.Hierarchical;
opt.MaxChunkSize = 1024;
});
var chunks = await curator.ChunkAsync(markdownText);
foreach (var chunk in chunks)
{
// Access hierarchy information
var level = chunk.Metadata.Custom?["HierarchyLevel"];
var parentId = chunk.Metadata.Custom?["ParentId"];
var sectionPath = chunk.Location.SectionPath;
Console.WriteLine($"[Level {level}] {sectionPath}");
Console.WriteLine(chunk.Content);
}
Full Pipeline Processing
// Complete preprocessing pipeline
var curator = new FluxCurator()
.WithTextRefinement(TextRefineOptions.Standard)
.WithContentFiltering()
.WithPIIMasking(PIIMaskingOptions.ForLanguages("en", "ko", "ja"))
.WithChunkingOptions(ChunkOptions.ForRAG);
// Process: Refine → Filter → Mask PII → Chunk
var result = await curator.PreprocessAsync(text);
Console.WriteLine(result.GetSummary());
// Output: "Produced 5 chunk(s). Filtered 2 content item(s). Masked 3 PII item(s)."
Semantic Chunking
// With LocalEmbedder integration (auto-loaded via DI)
var curator = new FluxCurator()
.UseEmbedder(myEmbedder)
.WithChunkingOptions(opt =>
{
opt.Strategy = ChunkingStrategy.Semantic;
opt.SemanticSimilarityThreshold = 0.5f;
});
var chunks = await curator.ChunkAsync(text);
// Chunks at natural semantic boundaries
Chunking Strategies
| Strategy | Description | Embedder Required | Best For |
|---|---|---|---|
Auto |
Automatically select best strategy | No | General use |
Sentence |
Split by sentence boundaries | No | Conversational text |
Paragraph |
Split by paragraph boundaries | No | Structured documents |
Token |
Split by token count | No | Consistent chunk sizes |
Semantic |
Split by semantic similarity | Yes | RAG applications |
Hierarchical |
Preserve document structure with parent-child relationships | No | Technical docs, Markdown |
Supported Languages
FluxCurator includes language profiles for accurate sentence detection and token estimation:
| Language | Code | Features |
|---|---|---|
| Korean | ko |
습니다체/해요체 endings, Korean sentence markers |
| English | en |
Standard sentence boundaries |
| Japanese | ja |
Japanese sentence endings (。、!?) |
| Chinese (Simplified) | zh |
Chinese punctuation |
| Chinese (Traditional) | zh-TW |
Traditional Chinese support |
| Spanish | es |
Spanish punctuation |
| French | fr |
French punctuation |
| German | de |
German punctuation |
| Portuguese | pt |
Portuguese punctuation |
| Russian | ru |
Cyrillic support |
| Arabic | ar |
RTL and Arabic punctuation |
| Hindi | hi |
Devanagari script support |
| Vietnamese | vi |
Latin with Vietnamese diacritics |
| Thai | th |
Thai script (no word spaces) |
PII Types Supported
Global PII Types
| Type | Description | Validation |
|---|---|---|
Email |
Email addresses | TLD validation |
Phone |
Phone numbers (International) | E.164 format validation |
CreditCard |
Credit card numbers | Luhn algorithm |
BankAccount |
Bank account numbers | Format validation |
IPAddress |
IPv4 and IPv6 addresses | Format validation |
URL |
URLs and web addresses | Format validation |
National ID Types by Country
| Country | Language Code | ID Type | Validation |
|---|---|---|---|
| Korea | ko |
Resident Registration Number (RRN) | Modulo-11 checksum |
| USA | en-US |
Social Security Number (SSN) | Area/Group validation |
| UK | en-GB |
National Insurance Number (NINO) | Prefix/Suffix validation |
| Japan | ja |
My Number | Check digit validation |
| China | zh-CN |
ID Card Number | ISO 7064 MOD 11-2 |
| Germany | de |
Personalausweis / Steuer-ID | Check digit validation |
| France | fr |
INSEE Number | Modulo-97 validation |
| Spain | es |
DNI / NIE | Check letter validation |
| Brazil | pt-BR |
CPF | Dual Modulo-11 |
| Italy | it |
Codice Fiscale | Check character validation |
| India | hi |
Aadhaar | Verhoeff checksum |
| Canada | en-CA |
Social Insurance Number (SIN) | Luhn algorithm |
| Australia | en-AU |
Tax File Number (TFN) | Weighted sum mod 11 |
Configuration Options
ChunkOptions
var options = new ChunkOptions
{
Strategy = ChunkingStrategy.Sentence,
TargetChunkSize = 512,
MinChunkSize = 100,
MaxChunkSize = 1024,
OverlapSize = 50,
LanguageCode = "ko", // null = auto-detect
PreserveSentences = true,
PreserveParagraphs = true,
SemanticSimilarityThreshold = 0.5f
};
// Preset configurations
ChunkOptions.Default // General purpose
ChunkOptions.ForRAG // Optimized for RAG (512 target, semantic)
ChunkOptions.FixedSize(256, 32) // Fixed token size with overlap
Masking Strategies
| Strategy | Example Output |
|---|---|
Token |
[EMAIL], [PHONE] |
Asterisk |
****@****.com |
Redact |
[REDACTED] |
Partial |
jo**@ex****.com |
Hash |
[HASH:a1b2c3d4] |
Remove |
(empty) |
Extensibility
FluxCurator is designed for extensibility. You can add custom PII detectors for your specific needs.
Custom PII Detector
Implement IPIIDetector or extend PIIDetectorBase for pattern-based detection:
using FluxCurator.Core.Core;
using FluxCurator.Core.Domain;
using FluxCurator.Core.Infrastructure.PII;
public class EmployeeIdDetector : PIIDetectorBase
{
public override PIIType PIIType => PIIType.Custom;
public override string Name => "Employee ID Detector";
// Pattern: EMP-123456
protected override string Pattern => @"EMP-\d{6}";
protected override bool ValidateMatch(string value, out float confidence)
{
confidence = 0.95f;
return true;
}
}
// Register and use via PIIMasker
var masker = new PIIMasker(PIIMaskingOptions.Default);
masker.RegisterDetector(new EmployeeIdDetector());
var result = masker.Mask("Contact employee EMP-123456 for details.");
// Output: "Contact employee [PII] for details."
// Or register directly via FluxCurator
var curator = new FluxCurator()
.WithPIIMasking()
.RegisterPIIDetector(new EmployeeIdDetector());
var curatorResult = curator.MaskPII("Contact employee EMP-123456 for details.");
// Output: "Contact employee [PII] for details."
Custom National ID Detector
Extend NationalIdDetectorBase to add support for additional countries:
using FluxCurator.Core.Core;
using FluxCurator.Core.Infrastructure.PII.NationalId;
public class IndiaAadhaarDetector : NationalIdDetectorBase
{
public override string LanguageCode => "hi";
public override string NationalIdType => "Aadhaar";
public override string FormatDescription => "12 digits with optional spaces";
public override string CountryName => "India";
public override string Name => "India Aadhaar Detector";
// Pattern: 1234 5678 9012 or 123456789012
protected override string Pattern => @"\d{4}\s?\d{4}\s?\d{4}";
protected override bool ValidateMatch(string value, out float confidence)
{
var normalized = NormalizeValue(value);
if (normalized.Length != 12 || !normalized.All(char.IsDigit))
{
confidence = 0.0f;
return false;
}
// Implement Verhoeff checksum validation
if (!ValidateVerhoeffChecksum(normalized))
{
confidence = 0.6f;
return true; // Still flag as PII
}
confidence = 0.98f;
return true;
}
private static bool ValidateVerhoeffChecksum(string number)
{
// Verhoeff algorithm implementation
// ...
return true;
}
}
// Register with the national ID registry
var registry = new NationalIdRegistry();
registry.Register(new IndiaAadhaarDetector());
var masker = new PIIMasker(
PIIMaskingOptions.ForLanguage("hi"),
registry);
Dependency Injection with Custom Detectors
// Register custom registry with additional detectors
services.AddSingleton<INationalIdRegistry>(sp =>
{
var registry = new NationalIdRegistry();
registry.Register(new IndiaAadhaarDetector());
registry.Register(new CanadaSINDetector());
registry.Register(new AustraliaTFNDetector());
return registry;
});
// Register PIIMasker with custom registry
services.AddScoped<IPIIMasker>(sp =>
{
var registry = sp.GetRequiredService<INationalIdRegistry>();
var options = PIIMaskingOptions.ForLanguages("en", "hi");
return new PIIMasker(options, registry);
});
Extension Points Summary
| Interface | Base Class | Purpose |
|---|---|---|
IPIIDetector |
PIIDetectorBase |
General PII detection (email, phone, custom) |
INationalIdDetector |
NationalIdDetectorBase |
Country-specific national ID detection |
INationalIdRegistry |
NationalIdRegistry |
Manage and lookup national ID detectors |
IPIIMasker |
PIIMasker |
Coordinate detection and masking |
Integration with Iyulab Ecosystem
FluxCurator is part of the Iyulab open-source RAG ecosystem:
┌─────────────────────────────────────────────────────────────┐
│ Foundation Layer │
├─────────────────────────────────────────────────────────────┤
│ LocalEmbedder LocalReranker FluxCurator FluxImprover│
│ (Embeddings) (Reranking) (Chunking) (LLM-based) │
└───────────┬───────────────────────────┬─────────────────────┘
│ │
▼ ▼
┌───────────────────────────────────────────────────────────────┐
│ Processing Layer │
├───────────────────────────────────────────────────────────────┤
│ FileFlux (Document Processing) WebFlux (Web) │
└───────────────────────────┬───────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────────┐
│ Storage Layer │
├───────────────────────────────────────────────────────────────┤
│ FluxIndex (Vector DB) │
└───────────────────────────┬───────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────────┐
│ Application Layer │
├───────────────────────────────────────────────────────────────┤
│ Filer (App) │
└───────────────────────────────────────────────────────────────┘
FileFlux Integration
using FileFlux.Infrastructure.Strategies;
using FileFlux.Infrastructure.Adapters;
// Use FluxCurator chunking in FileFlux
var chunkerFactory = new ChunkerFactory(embedder);
var strategy = new FluxCuratorChunkingStrategy(
chunkerFactory,
ChunkingStrategy.Hierarchical);
var chunks = await strategy.ChunkAsync(documentContent, options);
// Convert between chunk types
var fileFluxChunks = fluxCuratorChunks.ToFileFluxChunks();
var curatorChunks = fileFluxChunks.ToFluxCuratorChunks();
Project Structure
FluxCurator/
├── src/
│ ├── FluxCurator.Core/ # Zero-dependency core
│ │ ├── Core/ # Interfaces
│ │ │ ├── IChunker.cs
│ │ │ ├── IChunkerFactory.cs
│ │ │ ├── IEmbedder.cs
│ │ │ └── ILanguageProfile.cs
│ │ ├── Domain/ # Models
│ │ │ ├── ChunkOptions.cs
│ │ │ ├── DocumentChunk.cs
│ │ │ ├── ChunkingStrategy.cs
│ │ │ └── PIIMaskingOptions.cs
│ │ └── Infrastructure/ # Implementations
│ │ ├── Chunking/
│ │ │ ├── ChunkerBase.cs
│ │ │ ├── SentenceChunker.cs
│ │ │ ├── ParagraphChunker.cs
│ │ │ ├── TokenChunker.cs
│ │ │ ├── SemanticChunker.cs
│ │ │ └── HierarchicalChunker.cs
│ │ └── Languages/
│ │ ├── LanguageProfileRegistry.cs
│ │ ├── KoreanLanguageProfile.cs
│ │ └── EnglishLanguageProfile.cs
│ │
│ └── FluxCurator/ # Main package
│ ├── Infrastructure/
│ │ └── Chunking/
│ │ └── ChunkerFactory.cs # Factory with all strategies
│ ├── ServiceCollectionExtensions.cs
│ └── FluxCurator.cs # Main API
│
└── docs/ # Documentation
├── getting-started.md
├── chunking-strategies.md
├── di-integration.md
└── fileflux-integration.md
Documentation
- Getting Started - Installation and basic usage
- Chunking Strategies - Detailed guide for each strategy
- Dependency Injection - DI configuration and patterns
- FileFlux Integration - Integration with FileFlux
Roadmap
- Core chunking strategies (Sentence, Paragraph, Token)
- 11 language profiles for text processing
- Language detection
- Batch processing
- Multilingual PII masking (10 countries)
- Content filtering
- Semantic chunking
- Hierarchical chunking
- Dependency Injection support
- FileFlux integration
- Text refinement with Korean support
- Additional national ID detectors (India Aadhaar, Canada SIN, Australia TFN)
- Additional language profiles (Vietnamese, Thai)
- Custom detector registration via
RegisterPIIDetector - Streaming chunk support via
ChunkStreamAsync
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
License
MIT License - see LICENSE for details.
Part of the Iyulab Open Source Ecosystem
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net10.0
- FluxCurator.Core (>= 0.5.0)
- LocalEmbedder (>= 0.3.1)
- Microsoft.Extensions.DependencyInjection.Abstractions (>= 9.0.0)
NuGet packages (1)
Showing the top 1 NuGet packages that depend on FluxCurator:
| Package | Downloads |
|---|---|
|
FileFlux
Complete document processing SDK optimized for RAG systems. Transform PDF, DOCX, Excel, PowerPoint, Markdown and other formats into high-quality chunks with intelligent semantic boundary detection. Includes advanced chunking strategies, metadata extraction, and performance optimization. |
GitHub repositories
This package is not used by any popular GitHub repositories.