ChatAIze.RabbitHole 0.3.0

dotnet add package ChatAIze.RabbitHole --version 0.3.0
                    
NuGet\Install-Package ChatAIze.RabbitHole -Version 0.3.0
                    
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="ChatAIze.RabbitHole" Version="0.3.0" />
                    
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="ChatAIze.RabbitHole" Version="0.3.0" />
                    
Directory.Packages.props
<PackageReference Include="ChatAIze.RabbitHole" />
                    
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add ChatAIze.RabbitHole --version 0.3.0
                    
#r "nuget: ChatAIze.RabbitHole, 0.3.0"
                    
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package ChatAIze.RabbitHole@0.3.0
                    
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=ChatAIze.RabbitHole&version=0.3.0
                    
Install as a Cake Addin
#tool nuget:?package=ChatAIze.RabbitHole&version=0.3.0
                    
Install as a Cake Tool

Rabbit Hole

Rabbit Hole is a small, deterministic web text scraper for .NET. It discovers links within a root URL and extracts readable text from HTML pages. The output is a Markdown-like string suited for indexing, summarization, or offline processing.

Use cases

  • Build a lightweight search index for a site
  • Feed content into an LLM or summarization pipeline
  • Snapshot documentation pages for offline use
  • Validate a sitemap against actual in-page links

Features

  • Async breadth-first link discovery with de-duplication
  • Scope control to the root URL prefix
  • Skips common non-HTML assets by extension
  • HTML-only parsing based on Content-Type
  • Metadata extraction: title, meta description, meta keywords
  • Markdown-like content output for headings, paragraphs, and lists
  • Inline links and images preserved in the output
  • Cancellation support for long-running crawls

Requirements

  • .NET 10 (net10.0)

Install

dotnet add package ChatAIze.RabbitHole

Quick start

using ChatAIze.RabbitHole;

var scraper = new WebsiteScraper();

await foreach (var link in scraper.ScrapeLinksAsync("https://example.com", depth: 2))
{
    Console.WriteLine(link);
}

var page = await scraper.ScrapeContentAsync("https://example.com");
Console.WriteLine(page.Title);
Console.WriteLine(page.Content);

Usage patterns

using ChatAIze.RabbitHole;

var scraper = new WebsiteScraper();

await foreach (var link in scraper.ScrapeLinksAsync("https://example.com", depth: 3))
{
    var page = await scraper.ScrapeContentAsync(link);
    Console.WriteLine($"{page.Url} -> {page.Title}");
}

Cancel a long crawl

using ChatAIze.RabbitHole;

var scraper = new WebsiteScraper();
using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(30));

await foreach (var link in scraper.ScrapeLinksAsync("https://example.com", depth: 3, cts.Token))
{
    Console.WriteLine(link);
}
using ChatAIze.RabbitHole;

var scraper = new WebsiteScraper();

await foreach (var link in scraper.ScrapeLinksAsync("https://example.com", depth: 3))
{
    if (!link.Contains("/docs/"))
    {
        continue;
    }

    var page = await scraper.ScrapeContentAsync(link);
    Console.WriteLine(page.Content);
}
  • The root URL is always yielded first.
  • The crawl is breadth-first; the root is depth 1.
  • Links discovered on a page are yielded immediately.
  • Pages are only fetched if their depth is strictly less than the depth parameter.
    • Example: depth: 2 fetches the root page and yields its links, but does not fetch those links.
    • Example: depth: 3 fetches the root page and each linked page once, but does not go deeper.
  • URLs are normalized by trimming, lowercasing, and removing query strings and fragments.
  • Only URLs that start with the root URL prefix are considered in-scope.
  • Root-relative links (starting with /) are resolved against the root host.
  • Relative links without a leading slash are ignored.
  • The crawler ignores mailto:, tel:, and anchor-only (#...) links.
  • Responses are only parsed when the Content-Type is text/html.
  • Non-HTML assets are filtered by extension (see WebsiteScraper for the list).

Content extraction details

  • Non-HTML responses return a PageDetails instance with null metadata and content.
  • Standard metadata is extracted when available:
    • <title>
    • <meta name="description">
    • <meta name="keywords">
  • Content is selected from article, main, or div.content, falling back to the entire document.
  • Output is a Markdown-like text representation:
    • Headings h1-h6 map to #-style headings
    • Paragraphs become plain text with inline links and images preserved
    • Lists become - or numbered list items
  • Whitespace is collapsed to keep the output readable.

Output format

The output is Markdown-like and optimized for readability, not strict Markdown compliance.

# Welcome

This is a [link](https://example.com/about).

- First item
- Second item

Error handling and resiliency

  • ScrapeLinksAsync performs best-effort crawling and skips pages that fail to load or parse.
  • ScrapeContentAsync throws HttpRequestException for non-success status codes.
  • Cancellation is honored during link crawling and during content fetches.

Limitations and notes

  • No JavaScript rendering; content must be present in the HTML response.
  • No robots.txt handling or rate limiting is built in. Be mindful when crawling.
  • Lowercasing and query/fragment removal may collapse distinct URLs on case-sensitive servers.
  • In-scope checks use a simple string prefix; paths like /docs and /docs-old are both treated as in-scope.
  • Root-relative URLs are resolved with scheme and host only, which drops non-default ports.
  • Only anchor tags (<a href=...>) are used for link discovery.

API reference

WebsiteScraper

public async IAsyncEnumerable<string> ScrapeLinksAsync(
    string url,
    int depth = 2,
    CancellationToken cancellationToken = default)

public async ValueTask<PageDetails> ScrapeContentAsync(
    string url,
    CancellationToken cancellationToken = default)

PageDetails

public sealed record PageDetails(
    string Url,
    string? Title,
    string? Description,
    string? Keywords,
    string? Content);

Development

Build the library:

dotnet build

Run the preview app:

dotnet run --project ChatAIze.RabbitHole.Preview

License

GPL-3.0-or-later. See LICENSE.txt.

Product Compatible and additional computed target framework versions.
.NET net10.0 is compatible.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last Updated
0.3.0 55 12/20/2025
0.2.11 231 11/14/2025
0.2.10 318 9/18/2025
0.2.9 247 4/24/2025
0.2.8 245 3/19/2025
0.2.7 215 11/17/2024
0.2.6 190 11/13/2024
0.2.5 219 10/19/2024
0.2.4 170 10/8/2024
0.2.3 177 9/28/2024
0.2.2 149 9/27/2024
0.2.1 160 9/27/2024
0.2.0 181 9/27/2024
0.1.1 180 9/26/2024
0.1.0 167 9/26/2024