WebFlux 0.1.0

There is a newer version of this package available.
See the version list below for details.
dotnet add package WebFlux --version 0.1.0
                    
NuGet\Install-Package WebFlux -Version 0.1.0
                    
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="WebFlux" Version="0.1.0" />
                    
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="WebFlux" Version="0.1.0" />
                    
Directory.Packages.props
<PackageReference Include="WebFlux" />
                    
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add WebFlux --version 0.1.0
                    
#r "nuget: WebFlux, 0.1.0"
                    
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package WebFlux@0.1.0
                    
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=WebFlux&version=0.1.0
                    
Install as a Cake Addin
#tool nuget:?package=WebFlux&version=0.1.0
                    
Install as a Cake Tool

WebFlux

AI-Optimized Web Content Processing SDK for RAG Systems

NuGet Version NuGet Downloads GitHub Release Build Status

.NET Support License Test Coverage Code Quality

AI-Driven Web Intelligence Performance Memory Optimized

🎯 Overview

WebFlux is a RAG preprocessing SDK powered by the Web Intelligence Engine - a .NET 9 SDK that transforms web content into AI-friendly chunks.

🧠 Web Intelligence Engine (Phase 4-5B Complete)

Achieves 60% crawling efficiency improvement and AI-driven intelligent chunking through integrated analysis of 15 web metadata standards:

πŸ€– AI-Friendly Standards
  • πŸ€– llms.txt: Site structure guide for AI agents
  • 🧠 ai.txt: AI usage policies and ethical guidelines
  • πŸ“± manifest.json: PWA metadata and app information
  • πŸ€– robots.txt: RFC 9309 compliant crawling rules
πŸ—οΈ Structural Intelligence
  • πŸ—ΊοΈ sitemap.xml: XML/Text/RSS/Atom support with URL pattern analysis
  • πŸ“‹ README.md: Project structure and documentation analysis
  • βš™οΈ _config.yml: Jekyll/Hugo site configuration analysis
  • πŸ“¦ package.json: Node.js project metadata
πŸ”’ Security & Compliance
  • πŸ” security.txt: Security policies and contact information
  • πŸ›‘οΈ .well-known: Standard metadata directory
  • πŸ“Š ads.txt: Advertising policies and partnerships
  • 🏒 humans.txt: Team and contributor information

πŸ—οΈ Architecture Principle: Interface Provider

βœ… What WebFlux Provides:
  • 🧠 Web Intelligence: Integrated analysis of 15 metadata standards
  • πŸ•·οΈ Intelligent Crawling: Metadata-driven prioritization and optimization
  • πŸ“„ Advanced Content Extraction: 70% accuracy improvement using structural intelligence
  • πŸ”Œ AI Interfaces: Clean interface design for provider independence
  • πŸŽ›οΈ Processing Pipeline: Metadata Discovery β†’ Intelligent Crawling β†’ Optimized Chunking
❌ What WebFlux Does NOT Provide:
  • AI Service Implementations: Specific AI provider implementations excluded
  • Vector Generation: Embeddings are consumer app responsibility
  • Data Storage: Vector DB implementations excluded

✨ Key Features

  • 🧠 Web Intelligence Engine: Metadata-driven intelligent analysis
  • πŸ€– AI-Driven Auto Chunking: Phase 5B intelligent strategy selection with quality evaluation
  • πŸ“¦ Single NuGet Package: Easy installation with dotnet add package WebFlux
  • 🎯 Ethical AI Crawling: Responsible data collection through ai.txt standards
  • πŸ“± PWA Detection: Web app optimization through manifest.json analysis
  • πŸ•·οΈ RFC-Compliant Crawling: Full support for robots.txt, sitemap.xml
  • πŸ“„ 15 Standards Support: Integrated web metadata analysis
  • πŸŽ›οΈ 7 Chunking Strategies: Auto, Smart, Intelligent, MemoryOptimized, Semantic, Paragraph, FixedSize
  • πŸ–ΌοΈ Multimodal Processing: Text + Image β†’ Unified text conversion
  • ⚑ Parallel Processing: Dynamic scaling with memory backpressure control
  • πŸ“Š Real-time Streaming: Intelligent caching with real-time chunk delivery
  • πŸ” Quality Evaluation: 4-factor quality assessment with intelligent caching
  • πŸ—οΈ Clean Architecture: Dependency inversion with guaranteed extensibility

πŸš€ Quick Start

Installation

NuGet

Package Manager Console:

Install-Package WebFlux

dotnet CLI:

dotnet add package WebFlux

PackageReference (.csproj):

<PackageReference Include="WebFlux" Version="0.1.0" />

Basic Usage

using WebFlux;
using Microsoft.Extensions.DependencyInjection;

var services = new ServiceCollection();

// Required services (implemented by consumer application)
services.AddScoped<ITextCompletionService, YourLLMService>();        // LLM service
services.AddScoped<ITextEmbeddingService, YourEmbeddingService>();   // Embedding service

// Optional: Image-to-text service (for multimodal processing)
services.AddScoped<IImageToTextService, YourVisionService>();

// Or use OpenAI services (requires API key in environment variables)
// services.AddWebFluxOpenAIServices();

// Or use Mock services for testing
// services.AddWebFluxMockAIServices();

// Consumer application manages vector store
services.AddScoped<IVectorStore, YourVectorStore>();                // Vector storage

// Register WebFlux services (includes parallel processing and streaming engine)
services.AddWebFlux();

var provider = services.BuildServiceProvider();
var processor = provider.GetRequiredService<IWebContentProcessor>();
var embeddingService = provider.GetRequiredService<IEmbeddingService>();
var vectorStore = provider.GetRequiredService<IVectorStore>();

// Streaming processing (recommended - memory efficient, parallel optimized)
var crawlOptions = new CrawlOptions
{
    MaxDepth = 3,                    // Maximum crawling depth
    MaxPages = 100,                  // Maximum number of pages
    RespectRobotsTxt = true,         // Respect robots.txt
    DelayBetweenRequests = TimeSpan.FromMilliseconds(500)
};

await foreach (var result in processor.ProcessWithProgressAsync("https://docs.example.com", crawlOptions))
{
    if (result.IsSuccess && result.Result != null)
    {
        foreach (var chunk in result.Result)
        {
            Console.WriteLine($"πŸ“„ URL: {chunk.SourceUrl}");
            Console.WriteLine($"   Chunk {chunk.ChunkIndex}: {chunk.Content.Length} characters");

            // RAG pipeline: Generate embedding β†’ Store in vector database
            var embedding = await embeddingService.GenerateAsync(chunk.Content);
            await vectorStore.StoreAsync(new {
                Id = chunk.Id,
                Content = chunk.Content,
                Metadata = chunk.Metadata,
                Vector = embedding,
                SourceUrl = chunk.SourceUrl
            });
        }
    }
}

Step-by-Step Processing (Advanced Usage)

// Use when you want individual control over each stage

// Stage 1: Web Crawling (Crawler)
var crawlResults = await processor.CrawlAsync("https://docs.example.com", crawlOptions);
Console.WriteLine($"Crawled pages: {crawlResults.Count()}");

// Stage 2: Content Extraction (Extractor)
var extractedContents = new List<RawWebContent>();
foreach (var crawlResult in crawlResults)
{
    var rawContent = await processor.ExtractAsync(crawlResult.Url);
    extractedContents.Add(rawContent);
}

// Stage 3: Structural Analysis (Parser with LLM)
var parsedContents = new List<ParsedWebContent>();
foreach (var rawContent in extractedContents)
{
    var parsedContent = await processor.ParseAsync(rawContent);
    parsedContents.Add(parsedContent);
}

// Stage 4: Chunking (Chunking Strategy)
var allChunks = new List<WebContentChunk>();
foreach (var parsedContent in parsedContents)
{
    var chunks = await processor.ChunkAsync(parsedContent, new ChunkingOptions
    {
        Strategy = "Auto",   // Phase 5B AI-driven optimization (recommended)
        MaxChunkSize = 512,
        OverlapSize = 64
    });
    allChunks.AddRange(chunks);
}

Console.WriteLine($"Total chunks generated: {allChunks.Count}");

// Stage 5: RAG Pipeline (Embedding β†’ Storage)
foreach (var chunk in allChunks)
{
    var embedding = await embeddingService.GenerateAsync(chunk.Content);
    await vectorStore.StoreAsync(new {
        Id = chunk.Id,
        Content = chunk.Content,
        Metadata = chunk.Metadata,
        Vector = embedding,
        SourceUrl = chunk.SourceUrl
    });
}

Supported Content Formats

  • HTML (.html, .htm) - DOM structure analysis and content extraction
  • Markdown (.md) - Structure preservation
  • JSON (.json) - API responses and structured data
  • XML (.xml) - Including RSS/Atom feeds
  • RSS/Atom feeds - News and blog content
  • PDF (web-hosted) - Online document processing

πŸ•·οΈ Crawling Strategy Guide

Crawling Options

var crawlOptions = new CrawlOptions
{
    // Basic settings
    MaxDepth = 3,                                    // Maximum crawling depth
    MaxPages = 100,                                  // Maximum number of pages
    DelayBetweenRequests = TimeSpan.FromSeconds(1),  // Delay between requests

    // Compliance and courtesy
    RespectRobotsTxt = true,                         // Respect robots.txt
    UserAgent = "WebFlux/1.0 (+https://your-site.com/bot)", // User-Agent

    // Filtering
    AllowedDomains = ["docs.example.com", "help.example.com"], // Allowed domains
    ExcludePatterns = ["/admin/", "/private/", "*.pdf"],        // Exclude patterns
    IncludePatterns = ["/docs/", "/help/", "/api/"],            // Include patterns

    // Advanced settings
    MaxConcurrentRequests = 5,                       // Concurrent requests
    Timeout = TimeSpan.FromSeconds(30),              // Request timeout
    RetryCount = 3,                                  // Retry count

    // Content filters
    MinContentLength = 100,                          // Minimum content length
    MaxContentLength = 1000000,                      // Maximum content length
};

Crawling Strategies

Strategy Description Optimal Use Case
BreadthFirst Breadth-first search Need site-wide overview
DepthFirst Depth-first search Focus on specific sections
Intelligent LLM-based prioritization High-quality content first
Sitemap sitemap.xml based Structured sites

πŸŽ›οΈ Chunking Strategy Guide

Chunking Strategies

Strategy Selection Guide

Strategy Optimal Use Case Quality Score Memory Usage Status
Auto πŸ€– (recommended) All web content - AI-driven automatic optimization ⭐⭐⭐⭐⭐ 🟑 Medium βœ… Phase 5B Complete
Smart 🧠 HTML docs, API docs, structured content ⭐⭐⭐⭐⭐ 🟑 Medium βœ… Complete
Semantic πŸ” General web pages, articles, semantic consistency ⭐⭐⭐⭐⭐ 🟑 Medium βœ… Complete
Intelligent πŸ’‘ Blogs, news, knowledge bases ⭐⭐⭐⭐⭐ πŸ”΄ High βœ… Complete
MemoryOptimized ⚑ Large-scale sites, server environments ⭐⭐⭐⭐⭐ 🟒 Low (84% reduction) βœ… Complete
Paragraph πŸ“„ Markdown docs, wikis, paragraph structure preservation ⭐⭐⭐⭐ 🟒 Low βœ… Complete
FixedSize πŸ“ Uniform processing, test environments ⭐⭐⭐ 🟒 Low βœ… Complete

⚑ Enterprise-Grade Performance Optimization

Performance Verified Memory Efficient

πŸš€ 병렬 크둀링 μ—”μ§„

  • CPU 코어별 동적 μŠ€μΌ€μΌλ§: μ‹œμŠ€ν…œ λ¦¬μ†ŒμŠ€μ— 맞좘 μžλ™ ν™•μž₯
  • λ©”λͺ¨λ¦¬ λ°±ν”„λ ˆμ…” μ œμ–΄: Threading.Channels 기반 κ³ μ„±λŠ₯ 비동기 처리
  • μ§€λŠ₯ν˜• μž‘μ—… λΆ„μ‚°: νŽ˜μ΄μ§€ 크기와 λ³΅μž‘λ„μ— λ”°λ₯Έ 졜적 λΆ„λ°°
  • 쀑볡 제거: URL ν•΄μ‹œ 기반 μžλ™ 쀑볡 νŽ˜μ΄μ§€ 필터링

πŸ“Š 슀트리밍 μ΅œμ ν™”

  • μ‹€μ‹œκ°„ 청크 λ°˜ν™˜: AsyncEnumerable 기반 μ¦‰μ‹œ κ²°κ³Ό 제곡
  • LRU μΊμ‹œ μ‹œμŠ€ν…œ: URL ν•΄μ‹œ 기반 μžλ™ 캐싱 및 만료 관리
  • μΊμ‹œ μš°μ„  검사: 동일 νŽ˜μ΄μ§€ 재처리 μ‹œ μ¦‰μ‹œ λ°˜ν™˜

πŸ“ˆ κ²€μ¦λœ μ„±λŠ₯ μ§€ν‘œ

  • 크둀링 속도: 100νŽ˜μ΄μ§€/λΆ„ (평균 1MB νŽ˜μ΄μ§€ κΈ°μ€€)
  • λ©”λͺ¨λ¦¬ 효율: νŽ˜μ΄μ§€ 크기 1.5λ°° μ΄ν•˜ λ©”λͺ¨λ¦¬ μ‚¬μš©, MemoryOptimized μ „λž΅μœΌλ‘œ 84% μ ˆμ•½
  • ν’ˆμ§ˆ 보μž₯: 청크 완성도 81%, μ»¨ν…μŠ€νŠΈ 보쑴 75%+ 달성
  • AI 기반 μ΅œμ ν™”: Phase 5B Auto μ „λž΅μœΌλ‘œ 4μš”μ†Œ ν’ˆμ§ˆ 평가 및 μ§€λŠ₯ν˜• μ „λž΅ 선택
  • μ§€λŠ₯ν˜• 캐싱: ν’ˆμ§ˆ 기반 μΊμ‹œ 만료 (κ³ ν’ˆμ§ˆ 4μ‹œκ°„, μ €ν’ˆμ§ˆ 1μ‹œκ°„)
  • μ‹€μ‹œκ°„ λͺ¨λ‹ˆν„°λ§: OpenTelemetry 톡합, μ„±λŠ₯ 좔적 및 였λ₯˜ 감지
  • 병렬 ν™•μž₯: CPU μ½”μ–΄ μˆ˜μ— λ”°λ₯Έ μ„ ν˜• μ„±λŠ₯ ν–₯상
  • λΉŒλ“œ μ•ˆμ •μ„±: 38개 였λ₯˜ β†’ 0개 였λ₯˜λ‘œ 100% 컴파일 성곡
  • ν…ŒμŠ€νŠΈ 컀버리지: 90% ν…ŒμŠ€νŠΈ 컀버리지, ν”„λ‘œλ•μ…˜ μ•ˆμ •μ„± 검증

πŸ”§ κ³ κΈ‰ μ‚¬μš©λ²•

LLM μ„œλΉ„μŠ€ κ΅¬ν˜„ μ˜ˆμ‹œ (GPT-5-nano)

public class OpenAiTextCompletionService : ITextCompletionService
{
    private readonly OpenAIClient _client;

    public OpenAiTextCompletionService(string apiKey)
    {
        _client = new OpenAIClient(apiKey);
    }

    public async Task<string> CompleteAsync(
        string prompt,
        TextCompletionOptions? options = null,
        CancellationToken cancellationToken = default)
    {
        var chatClient = _client.GetChatClient("gpt-5-nano"); // μ΅œμ‹  λͺ¨λΈ μ‚¬μš©

        var response = await chatClient.CompleteChatAsync(
            [new UserChatMessage(prompt)],
            new ChatCompletionOptions
            {
                MaxOutputTokenCount = options?.MaxTokens ?? 2000,
                Temperature = options?.Temperature ?? 0.3f
            },
            cancellationToken);

        return response.Value.Content[0].Text;
    }
}

λ©€ν‹°λͺ¨λ‹¬ 처리 - μ›Ή 이미지 ν…μŠ€νŠΈ μΆ”μΆœ

public class OpenAiImageToTextService : IImageToTextService
{
    private readonly OpenAIClient _client;
    private readonly HttpClient _httpClient;

    public OpenAiImageToTextService(string apiKey, HttpClient httpClient)
    {
        _client = new OpenAIClient(apiKey);
        _httpClient = httpClient;
    }

    public async Task<ImageToTextResult> ExtractTextFromWebImageAsync(
        string imageUrl,
        ImageToTextOptions? options = null,
        CancellationToken cancellationToken = default)
    {
        // μ›Ή 이미지 λ‹€μš΄λ‘œλ“œ
        var imageData = await _httpClient.GetByteArrayAsync(imageUrl, cancellationToken);
        
        var chatClient = _client.GetChatClient("gpt-5-nano");

        var messages = new List<ChatMessage>
        {
            new SystemChatMessage("μ›ΉνŽ˜μ΄μ§€ μ΄λ―Έμ§€μ—μ„œ λͺ¨λ“  ν…μŠ€νŠΈλ₯Ό μ •ν™•νžˆ μΆ”μΆœν•˜μ„Έμš”."),
            new UserChatMessage(ChatMessageContentPart.CreateImagePart(
                BinaryData.FromBytes(imageData), "image/jpeg"))
        };

        var response = await chatClient.CompleteChatAsync(messages, new ChatCompletionOptions
        {
            MaxOutputTokenCount = 1000,
            Temperature = 0.1f
        }, cancellationToken);

        return new ImageToTextResult
        {
            ExtractedText = response.Value.Content[0].Text,
            Confidence = 0.95,
            IsSuccess = true,
            SourceUrl = imageUrl
        };
    }
}

RAG νŒŒμ΄ν”„λΌμΈ 톡합

public class WebRagService
{
    private readonly IWebContentProcessor _processor;
    private readonly IEmbeddingService _embeddingService;
    private readonly IVectorStore _vectorStore;

    public async Task IndexWebsiteAsync(string baseUrl, CrawlOptions? crawlOptions = null)
    {
        crawlOptions ??= new CrawlOptions
        {
            MaxDepth = 3,
            MaxPages = 100,
            Strategy = "Intelligent"
        };

        var chunkingOptions = new ChunkingOptions
        {
            Strategy = "Auto",   // Phase 5B AI 기반 μžλ™ μ΅œμ ν™” (ꢌμž₯)
            MaxChunkSize = 512,
            OverlapSize = 64
        };

        await foreach (var result in _processor.ProcessWithProgressAsync(baseUrl, crawlOptions, chunkingOptions))
        {
            if (result.IsSuccess && result.Result != null)
            {
                foreach (var chunk in result.Result)
                {
                    // μž„λ² λ”© 생성 및 μ €μž₯
                    var embedding = await _embeddingService.GenerateAsync(chunk.Content);
                    await _vectorStore.StoreAsync(new VectorDocument
                    {
                        Id = chunk.Id,
                        Content = chunk.Content,
                        Metadata = chunk.Metadata,
                        Vector = embedding,
                        SourceUrl = chunk.SourceUrl,
                        CrawledAt = DateTime.UtcNow
                    });
                }
            }

            // μ§„ν–‰λ₯  ν‘œμ‹œ
            if (result.Progress != null)
            {
                Console.WriteLine($"크둀링 μ§„ν–‰λ₯ : {result.Progress.PagesProcessed}/{result.Progress.TotalPages}");
                Console.WriteLine($"μ²­ν‚Ή μ§„ν–‰λ₯ : {result.Progress.PercentComplete:F1}%");
                if (result.Progress.EstimatedRemainingTime.HasValue)
                {
                    Console.WriteLine($"μ˜ˆμƒ 남은 μ‹œκ°„: {result.Progress.EstimatedRemainingTime.Value:mm\\:ss}");
                }
            }
        }
    }

    public async Task UpdateWebsiteContentAsync(string baseUrl)
    {
        // 증뢄 μ—…λ°μ΄νŠΈ - λ³€κ²½λœ νŽ˜μ΄μ§€λ§Œ 재처리
        var lastCrawlTime = await _vectorStore.GetLastCrawlTimeAsync(baseUrl);
        
        var crawlOptions = new CrawlOptions
        {
            MaxDepth = 3,
            IfModifiedSince = lastCrawlTime,
            Strategy = "Intelligent"
        };

        await IndexWebsiteAsync(baseUrl, crawlOptions);
    }
}

μ»€μŠ€ν…€ μ½˜ν…μΈ  μΆ”μΆœκΈ°

public class CustomContentExtractor : IContentExtractor
{
    public string ExtractorType => "CustomExtractor";
    public IEnumerable<string> SupportedContentTypes => ["application/custom", "text/custom"];

    public bool CanExtract(string contentType, string url) =>
        contentType.StartsWith("application/custom") || url.Contains("custom-api");

    public async Task<RawWebContent> ExtractAsync(
        string url, 
        HttpResponseMessage response, 
        CancellationToken cancellationToken = default)
    {
        var content = await response.Content.ReadAsStringAsync(cancellationToken);
        
        // μ»€μŠ€ν…€ νŒŒμ‹± 둜직
        var parsedContent = ParseCustomFormat(content);
        
        return new RawWebContent
        {
            Url = url,
            Content = parsedContent,
            ContentType = response.Content.Headers.ContentType?.MediaType ?? "application/custom",
            Metadata = new WebContentMetadata
            {
                Title = ExtractTitle(parsedContent),
                Description = ExtractDescription(parsedContent),
                Keywords = ExtractKeywords(parsedContent),
                LastModified = response.Content.Headers.LastModified?.DateTime,
                ContentLength = content.Length,
                Properties = new Dictionary<string, object>
                {
                    ["CustomProperty"] = "CustomValue"
                }
            }
        };
    }

    private string ParseCustomFormat(string content) => content; // κ΅¬ν˜„ ν•„μš”
    private string ExtractTitle(string content) => ""; // κ΅¬ν˜„ ν•„μš”
    private string ExtractDescription(string content) => ""; // κ΅¬ν˜„ ν•„μš”
    private List<string> ExtractKeywords(string content) => new(); // κ΅¬ν˜„ ν•„μš”
}

// 등둝
services.AddTransient<IContentExtractor, CustomContentExtractor>();

πŸ“š λ¬Έμ„œ 및 κ°€μ΄λ“œ

πŸ“– μ£Όμš” λ¬Έμ„œ

Product Compatible and additional computed target framework versions.
.NET net8.0 is compatible.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed.  net9.0 is compatible.  net9.0-android was computed.  net9.0-browser was computed.  net9.0-ios was computed.  net9.0-maccatalyst was computed.  net9.0-macos was computed.  net9.0-tvos was computed.  net9.0-windows was computed.  net10.0 was computed.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last Updated
0.1.1 237 9/18/2025
0.1.0 234 9/17/2025