FluxCurator 0.5.0

dotnet add package FluxCurator --version 0.5.0
                    
NuGet\Install-Package FluxCurator -Version 0.5.0
                    
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="FluxCurator" Version="0.5.0" />
                    
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="FluxCurator" Version="0.5.0" />
                    
Directory.Packages.props
<PackageReference Include="FluxCurator" />
                    
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add FluxCurator --version 0.5.0
                    
#r "nuget: FluxCurator, 0.5.0"
                    
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package FluxCurator@0.5.0
                    
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=FluxCurator&version=0.5.0
                    
Install as a Cake Addin
#tool nuget:?package=FluxCurator&version=0.5.0
                    
Install as a Cake Tool

FluxCurator

Clean, protect, and chunk your text for RAG pipelines — no dependencies required.

NuGet .NET License

Overview

FluxCurator is a text preprocessing library for RAG (Retrieval-Augmented Generation) pipelines. It provides multilingual PII masking, content filtering, and intelligent text chunking with support for 14 languages and 13 countries' national IDs.

Zero Dependencies Philosophy: Core functionality (FluxCurator.Core) works standalone with no external dependencies. The main package (FluxCurator) adds optional LocalEmbedder integration for semantic chunking.

Features

  • Text Refinement - Clean noisy text by removing blank lines, duplicates, empty list markers, and custom patterns
  • Multilingual PII Masking - Auto-detect and mask emails, phones, national IDs, credit cards across 14 languages
  • Content Filtering - Filter harmful content with customizable rules and blocklists
  • Smart Chunking - Rule-based chunking (sentence, paragraph, token)
  • Semantic Chunking - Embedding-based chunking for semantic boundaries
  • Hierarchical Chunking - Document structure-aware chunking with parent-child relationships
  • Multi-Language Support - 14 languages including Korean, English, Japanese, Chinese, Vietnamese, Thai
  • National ID Validation - Checksum validation for 13 countries including SSN (US), RRN (Korea), Aadhaar (India), SIN (Canada)
  • Streaming Support - Memory-efficient streaming chunk generation via ChunkStreamAsync
  • Pipeline Processing - Combine filtering, masking, and chunking in one call
  • Dependency Injection - Full DI support with IServiceCollection extensions
  • FileFlux Integration - Seamless integration with FileFlux document processing

Installation

# Main package (includes LocalEmbedder for semantic chunking)
dotnet add package FluxCurator

# Core package only (zero dependencies)
dotnet add package FluxCurator.Core

Quick Start

Basic Chunking

using FluxCurator;
using FluxCurator.Core.Domain;

// Create curator with default options
var curator = new FluxCurator();

// Chunk text using sentence strategy
var chunks = await curator.ChunkAsync(text);

foreach (var chunk in chunks)
{
    Console.WriteLine($"Chunk {chunk.Index + 1}/{chunk.TotalChunks}:");
    Console.WriteLine(chunk.Content);
    Console.WriteLine($"Tokens: ~{chunk.Metadata.EstimatedTokenCount}");
}

Streaming Chunks

// Memory-efficient streaming for large texts
var curator = new FluxCurator();

await foreach (var chunk in curator.ChunkStreamAsync(largeText))
{
    // Process chunks as they are generated
    Console.WriteLine($"Chunk {chunk.Index}: {chunk.Content.Length} chars");
    await ProcessChunkAsync(chunk);
}

Dependency Injection

// Program.cs or Startup.cs
services.AddFluxCurator(options =>
{
    options.DefaultChunkOptions = ChunkOptions.ForRAG;
    options.EnablePIIMasking = true;
    options.EnableContentFiltering = true;
});

// Or with LocalEmbedder for semantic chunking
services.AddFluxCuratorWithLocalEmbedder(options =>
{
    options.DefaultChunkOptions = new ChunkOptions
    {
        Strategy = ChunkingStrategy.Semantic,
        TargetChunkSize = 512
    };
});

Using IChunkerFactory

// Inject IChunkerFactory for flexible chunker creation
public class MyService
{
    private readonly IChunkerFactory _chunkerFactory;

    public MyService(IChunkerFactory chunkerFactory)
    {
        _chunkerFactory = chunkerFactory;
    }

    public async Task<IReadOnlyList<DocumentChunk>> ProcessAsync(string text)
    {
        // Create specific chunker
        var chunker = _chunkerFactory.CreateChunker(ChunkingStrategy.Hierarchical);
        return await chunker.ChunkAsync(text, ChunkOptions.Default);
    }
}

Text Refinement

// Clean noisy text before processing
var curator = new FluxCurator()
    .WithTextRefinement(TextRefineOptions.Standard);

var result = await curator.PreprocessAsync(rawText);
// Pipeline: Refine → Filter → Mask → Chunk

// Use presets for specific content types
TextRefineOptions.Light        // Minimal: empty list markers, trim, collapse blanks
TextRefineOptions.Standard     // Default: + remove duplicates
TextRefineOptions.ForWebContent  // Web-optimized: aggressive cleaning
TextRefineOptions.ForKorean    // Korean: removes 댓글 sections, copyright
TextRefineOptions.ForPdfContent  // PDF: removes page numbers

// Custom patterns
var options = new TextRefineOptions
{
    RemoveBlankLines = true,
    RemoveDuplicateLines = true,
    RemoveEmptyListItems = true,  // Supports Korean markers: ㅇ, ○, ●, □, ■
    TrimLines = true,
    RemovePatterns = [@"^#\s*댓글\s*$", @"^\[광고\].*$"]
};

PII Masking

// Enable PII masking
var curator = new FluxCurator()
    .WithPIIMasking();

// Mask PII in text
var result = curator.MaskPII("Contact: 010-1234-5678, Email: test@example.com");
Console.WriteLine(result.MaskedText);
// Output: "Contact: [PHONE], Email: [EMAIL]"

Multilingual National ID Detection

// Auto-detect PII for all supported languages
var curator = new FluxCurator()
    .WithPIIMasking(PIIMaskingOptions.Default);

var result = curator.MaskPII("SSN: 123-45-6789, RRN: 901231-1234567");
// Output: "SSN: [NATIONAL_ID], RRN: [NATIONAL_ID]"

// Detect for specific language
var koreanCurator = new FluxCurator()
    .WithPIIMasking(PIIMaskingOptions.ForLanguage("ko"));

var krResult = koreanCurator.MaskPII("주민등록번호: 901231-1234567");
// Output: "주민등록번호: [NATIONAL_ID]"
// Validates using Modulo-11 checksum algorithm

// Detect for multiple languages
var multiCurator = new FluxCurator()
    .WithPIIMasking(PIIMaskingOptions.ForLanguages("en-US", "ko", "pt-BR"));

Hierarchical Chunking

var curator = new FluxCurator()
    .WithChunkingOptions(opt =>
    {
        opt.Strategy = ChunkingStrategy.Hierarchical;
        opt.MaxChunkSize = 1024;
    });

var chunks = await curator.ChunkAsync(markdownText);

foreach (var chunk in chunks)
{
    // Access hierarchy information
    var level = chunk.Metadata.Custom?["HierarchyLevel"];
    var parentId = chunk.Metadata.Custom?["ParentId"];
    var sectionPath = chunk.Location.SectionPath;

    Console.WriteLine($"[Level {level}] {sectionPath}");
    Console.WriteLine(chunk.Content);
}

Full Pipeline Processing

// Complete preprocessing pipeline
var curator = new FluxCurator()
    .WithTextRefinement(TextRefineOptions.Standard)
    .WithContentFiltering()
    .WithPIIMasking(PIIMaskingOptions.ForLanguages("en", "ko", "ja"))
    .WithChunkingOptions(ChunkOptions.ForRAG);

// Process: Refine → Filter → Mask PII → Chunk
var result = await curator.PreprocessAsync(text);

Console.WriteLine(result.GetSummary());
// Output: "Produced 5 chunk(s). Filtered 2 content item(s). Masked 3 PII item(s)."

Semantic Chunking

// With LocalEmbedder integration (auto-loaded via DI)
var curator = new FluxCurator()
    .UseEmbedder(myEmbedder)
    .WithChunkingOptions(opt =>
    {
        opt.Strategy = ChunkingStrategy.Semantic;
        opt.SemanticSimilarityThreshold = 0.5f;
    });

var chunks = await curator.ChunkAsync(text);
// Chunks at natural semantic boundaries

Chunking Strategies

Strategy Description Embedder Required Best For
Auto Automatically select best strategy No General use
Sentence Split by sentence boundaries No Conversational text
Paragraph Split by paragraph boundaries No Structured documents
Token Split by token count No Consistent chunk sizes
Semantic Split by semantic similarity Yes RAG applications
Hierarchical Preserve document structure with parent-child relationships No Technical docs, Markdown

Supported Languages

FluxCurator includes language profiles for accurate sentence detection and token estimation:

Language Code Features
Korean ko 습니다체/해요체 endings, Korean sentence markers
English en Standard sentence boundaries
Japanese ja Japanese sentence endings (。、!?)
Chinese (Simplified) zh Chinese punctuation
Chinese (Traditional) zh-TW Traditional Chinese support
Spanish es Spanish punctuation
French fr French punctuation
German de German punctuation
Portuguese pt Portuguese punctuation
Russian ru Cyrillic support
Arabic ar RTL and Arabic punctuation
Hindi hi Devanagari script support
Vietnamese vi Latin with Vietnamese diacritics
Thai th Thai script (no word spaces)

PII Types Supported

Global PII Types

Type Description Validation
Email Email addresses TLD validation
Phone Phone numbers (International) E.164 format validation
CreditCard Credit card numbers Luhn algorithm
BankAccount Bank account numbers Format validation
IPAddress IPv4 and IPv6 addresses Format validation
URL URLs and web addresses Format validation

National ID Types by Country

Country Language Code ID Type Validation
Korea ko Resident Registration Number (RRN) Modulo-11 checksum
USA en-US Social Security Number (SSN) Area/Group validation
UK en-GB National Insurance Number (NINO) Prefix/Suffix validation
Japan ja My Number Check digit validation
China zh-CN ID Card Number ISO 7064 MOD 11-2
Germany de Personalausweis / Steuer-ID Check digit validation
France fr INSEE Number Modulo-97 validation
Spain es DNI / NIE Check letter validation
Brazil pt-BR CPF Dual Modulo-11
Italy it Codice Fiscale Check character validation
India hi Aadhaar Verhoeff checksum
Canada en-CA Social Insurance Number (SIN) Luhn algorithm
Australia en-AU Tax File Number (TFN) Weighted sum mod 11

Configuration Options

ChunkOptions

var options = new ChunkOptions
{
    Strategy = ChunkingStrategy.Sentence,
    TargetChunkSize = 512,
    MinChunkSize = 100,
    MaxChunkSize = 1024,
    OverlapSize = 50,
    LanguageCode = "ko",  // null = auto-detect
    PreserveSentences = true,
    PreserveParagraphs = true,
    SemanticSimilarityThreshold = 0.5f
};

// Preset configurations
ChunkOptions.Default       // General purpose
ChunkOptions.ForRAG        // Optimized for RAG (512 target, semantic)
ChunkOptions.FixedSize(256, 32)  // Fixed token size with overlap

Masking Strategies

Strategy Example Output
Token [EMAIL], [PHONE]
Asterisk ****@****.com
Redact [REDACTED]
Partial jo**@ex****.com
Hash [HASH:a1b2c3d4]
Remove (empty)

Extensibility

FluxCurator is designed for extensibility. You can add custom PII detectors for your specific needs.

Custom PII Detector

Implement IPIIDetector or extend PIIDetectorBase for pattern-based detection:

using FluxCurator.Core.Core;
using FluxCurator.Core.Domain;
using FluxCurator.Core.Infrastructure.PII;

public class EmployeeIdDetector : PIIDetectorBase
{
    public override PIIType PIIType => PIIType.Custom;
    public override string Name => "Employee ID Detector";

    // Pattern: EMP-123456
    protected override string Pattern => @"EMP-\d{6}";

    protected override bool ValidateMatch(string value, out float confidence)
    {
        confidence = 0.95f;
        return true;
    }
}

// Register and use via PIIMasker
var masker = new PIIMasker(PIIMaskingOptions.Default);
masker.RegisterDetector(new EmployeeIdDetector());

var result = masker.Mask("Contact employee EMP-123456 for details.");
// Output: "Contact employee [PII] for details."

// Or register directly via FluxCurator
var curator = new FluxCurator()
    .WithPIIMasking()
    .RegisterPIIDetector(new EmployeeIdDetector());

var curatorResult = curator.MaskPII("Contact employee EMP-123456 for details.");
// Output: "Contact employee [PII] for details."

Custom National ID Detector

Extend NationalIdDetectorBase to add support for additional countries:

using FluxCurator.Core.Core;
using FluxCurator.Core.Infrastructure.PII.NationalId;

public class IndiaAadhaarDetector : NationalIdDetectorBase
{
    public override string LanguageCode => "hi";
    public override string NationalIdType => "Aadhaar";
    public override string FormatDescription => "12 digits with optional spaces";
    public override string CountryName => "India";
    public override string Name => "India Aadhaar Detector";

    // Pattern: 1234 5678 9012 or 123456789012
    protected override string Pattern => @"\d{4}\s?\d{4}\s?\d{4}";

    protected override bool ValidateMatch(string value, out float confidence)
    {
        var normalized = NormalizeValue(value);

        if (normalized.Length != 12 || !normalized.All(char.IsDigit))
        {
            confidence = 0.0f;
            return false;
        }

        // Implement Verhoeff checksum validation
        if (!ValidateVerhoeffChecksum(normalized))
        {
            confidence = 0.6f;
            return true; // Still flag as PII
        }

        confidence = 0.98f;
        return true;
    }

    private static bool ValidateVerhoeffChecksum(string number)
    {
        // Verhoeff algorithm implementation
        // ...
        return true;
    }
}

// Register with the national ID registry
var registry = new NationalIdRegistry();
registry.Register(new IndiaAadhaarDetector());

var masker = new PIIMasker(
    PIIMaskingOptions.ForLanguage("hi"),
    registry);

Dependency Injection with Custom Detectors

// Register custom registry with additional detectors
services.AddSingleton<INationalIdRegistry>(sp =>
{
    var registry = new NationalIdRegistry();
    registry.Register(new IndiaAadhaarDetector());
    registry.Register(new CanadaSINDetector());
    registry.Register(new AustraliaTFNDetector());
    return registry;
});

// Register PIIMasker with custom registry
services.AddScoped<IPIIMasker>(sp =>
{
    var registry = sp.GetRequiredService<INationalIdRegistry>();
    var options = PIIMaskingOptions.ForLanguages("en", "hi");
    return new PIIMasker(options, registry);
});

Extension Points Summary

Interface Base Class Purpose
IPIIDetector PIIDetectorBase General PII detection (email, phone, custom)
INationalIdDetector NationalIdDetectorBase Country-specific national ID detection
INationalIdRegistry NationalIdRegistry Manage and lookup national ID detectors
IPIIMasker PIIMasker Coordinate detection and masking

Integration with Iyulab Ecosystem

FluxCurator is part of the Iyulab open-source RAG ecosystem:

┌─────────────────────────────────────────────────────────────┐
│                    Foundation Layer                          │
├─────────────────────────────────────────────────────────────┤
│  LocalEmbedder    LocalReranker    FluxCurator  FluxImprover│
│  (Embeddings)     (Reranking)      (Chunking)   (LLM-based) │
└───────────┬───────────────────────────┬─────────────────────┘
            │                           │
            ▼                           ▼
┌───────────────────────────────────────────────────────────────┐
│                    Processing Layer                           │
├───────────────────────────────────────────────────────────────┤
│        FileFlux (Document Processing)    WebFlux (Web)        │
└───────────────────────────┬───────────────────────────────────┘
                            │
                            ▼
┌───────────────────────────────────────────────────────────────┐
│                    Storage Layer                              │
├───────────────────────────────────────────────────────────────┤
│                    FluxIndex (Vector DB)                      │
└───────────────────────────┬───────────────────────────────────┘
                            │
                            ▼
┌───────────────────────────────────────────────────────────────┐
│                    Application Layer                          │
├───────────────────────────────────────────────────────────────┤
│                        Filer (App)                            │
└───────────────────────────────────────────────────────────────┘

FileFlux Integration

using FileFlux.Infrastructure.Strategies;
using FileFlux.Infrastructure.Adapters;

// Use FluxCurator chunking in FileFlux
var chunkerFactory = new ChunkerFactory(embedder);
var strategy = new FluxCuratorChunkingStrategy(
    chunkerFactory,
    ChunkingStrategy.Hierarchical);

var chunks = await strategy.ChunkAsync(documentContent, options);

// Convert between chunk types
var fileFluxChunks = fluxCuratorChunks.ToFileFluxChunks();
var curatorChunks = fileFluxChunks.ToFluxCuratorChunks();

Project Structure

FluxCurator/
├── src/
│   ├── FluxCurator.Core/              # Zero-dependency core
│   │   ├── Core/                      # Interfaces
│   │   │   ├── IChunker.cs
│   │   │   ├── IChunkerFactory.cs
│   │   │   ├── IEmbedder.cs
│   │   │   └── ILanguageProfile.cs
│   │   ├── Domain/                    # Models
│   │   │   ├── ChunkOptions.cs
│   │   │   ├── DocumentChunk.cs
│   │   │   ├── ChunkingStrategy.cs
│   │   │   └── PIIMaskingOptions.cs
│   │   └── Infrastructure/            # Implementations
│   │       ├── Chunking/
│   │       │   ├── ChunkerBase.cs
│   │       │   ├── SentenceChunker.cs
│   │       │   ├── ParagraphChunker.cs
│   │       │   ├── TokenChunker.cs
│   │       │   ├── SemanticChunker.cs
│   │       │   └── HierarchicalChunker.cs
│   │       └── Languages/
│   │           ├── LanguageProfileRegistry.cs
│   │           ├── KoreanLanguageProfile.cs
│   │           └── EnglishLanguageProfile.cs
│   │
│   └── FluxCurator/                   # Main package
│       ├── Infrastructure/
│       │   └── Chunking/
│       │       └── ChunkerFactory.cs  # Factory with all strategies
│       ├── ServiceCollectionExtensions.cs
│       └── FluxCurator.cs             # Main API
│
└── docs/                              # Documentation
    ├── getting-started.md
    ├── chunking-strategies.md
    ├── di-integration.md
    └── fileflux-integration.md

Documentation

Roadmap

  • Core chunking strategies (Sentence, Paragraph, Token)
  • 11 language profiles for text processing
  • Language detection
  • Batch processing
  • Multilingual PII masking (10 countries)
  • Content filtering
  • Semantic chunking
  • Hierarchical chunking
  • Dependency Injection support
  • FileFlux integration
  • Text refinement with Korean support
  • Additional national ID detectors (India Aadhaar, Canada SIN, Australia TFN)
  • Additional language profiles (Vietnamese, Thai)
  • Custom detector registration via RegisterPIIDetector
  • Streaming chunk support via ChunkStreamAsync

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

License

MIT License - see LICENSE for details.


Part of the Iyulab Open Source Ecosystem

Product Compatible and additional computed target framework versions.
.NET net10.0 is compatible.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages (1)

Showing the top 1 NuGet packages that depend on FluxCurator:

Package Downloads
FileFlux

Complete document processing SDK optimized for RAG systems. Transform PDF, DOCX, Excel, PowerPoint, Markdown and other formats into high-quality chunks with intelligent semantic boundary detection. Includes advanced chunking strategies, metadata extraction, and performance optimization.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last Updated
0.5.0 584 12/1/2025
0.4.0 471 12/1/2025
0.3.0 463 12/1/2025
0.2.0 400 12/1/2025
0.1.0 331 11/30/2025