ChatAIze.RabbitHole 0.3.0

.NET 10.0

dotnet add package ChatAIze.RabbitHole --version 0.3.0

NuGet\Install-Package ChatAIze.RabbitHole -Version 0.3.0

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="ChatAIze.RabbitHole" Version="0.3.0" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="ChatAIze.RabbitHole" Version="0.3.0" />
                    

                            Directory.Packages.props

<PackageReference Include="ChatAIze.RabbitHole" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add ChatAIze.RabbitHole --version 0.3.0

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: ChatAIze.RabbitHole, 0.3.0"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package ChatAIze.RabbitHole@0.3.0

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=ChatAIze.RabbitHole&version=0.3.0
                    

                            Install as a Cake Addin

#tool nuget:?package=ChatAIze.RabbitHole&version=0.3.0
                    

                            Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

Rabbit Hole

Rabbit Hole is a small, deterministic web text scraper for .NET. It discovers links within a root URL and extracts readable text from HTML pages. The output is a Markdown-like string suited for indexing, summarization, or offline processing.

Use cases

Build a lightweight search index for a site
Feed content into an LLM or summarization pipeline
Snapshot documentation pages for offline use
Validate a sitemap against actual in-page links

Features

Async breadth-first link discovery with de-duplication
Scope control to the root URL prefix
Skips common non-HTML assets by extension
HTML-only parsing based on Content-Type
Metadata extraction: title, meta description, meta keywords
Markdown-like content output for headings, paragraphs, and lists
Inline links and images preserved in the output
Cancellation support for long-running crawls

Requirements

.NET 10 (net10.0)

Install

dotnet add package ChatAIze.RabbitHole

Quick start

using ChatAIze.RabbitHole;

var scraper = new WebsiteScraper();

await foreach (var link in scraper.ScrapeLinksAsync("https://example.com", depth: 2))
{
    Console.WriteLine(link);
}

var page = await scraper.ScrapeContentAsync("https://example.com");
Console.WriteLine(page.Title);
Console.WriteLine(page.Content);

Usage patterns

Crawl links, then fetch content

using ChatAIze.RabbitHole;

var scraper = new WebsiteScraper();

await foreach (var link in scraper.ScrapeLinksAsync("https://example.com", depth: 3))
{
    var page = await scraper.ScrapeContentAsync(link);
    Console.WriteLine($"{page.Url} -> {page.Title}");
}

Cancel a long crawl

using ChatAIze.RabbitHole;

var scraper = new WebsiteScraper();
using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(30));

await foreach (var link in scraper.ScrapeLinksAsync("https://example.com", depth: 3, cts.Token))
{
    Console.WriteLine(link);
}

Filter links before scraping content

using ChatAIze.RabbitHole;

var scraper = new WebsiteScraper();

await foreach (var link in scraper.ScrapeLinksAsync("https://example.com", depth: 3))
{
    if (!link.Contains("/docs/"))
    {
        continue;
    }

    var page = await scraper.ScrapeContentAsync(link);
    Console.WriteLine(page.Content);
}

Link discovery details

The root URL is always yielded first.
The crawl is breadth-first; the root is depth 1.
Links discovered on a page are yielded immediately.
Pages are only fetched if their depth is strictly less than the depth parameter.
- Example: depth: 2 fetches the root page and yields its links, but does not fetch those links.
- Example: depth: 3 fetches the root page and each linked page once, but does not go deeper.
URLs are normalized by trimming, lowercasing, and removing query strings and fragments.
Only URLs that start with the root URL prefix are considered in-scope.
Root-relative links (starting with /) are resolved against the root host.
Relative links without a leading slash are ignored.
The crawler ignores mailto:, tel:, and anchor-only (#...) links.
Responses are only parsed when the Content-Type is text/html.
Non-HTML assets are filtered by extension (see WebsiteScraper for the list).

Content extraction details

Non-HTML responses return a PageDetails instance with null metadata and content.
Standard metadata is extracted when available:
- <title>
- <meta name="description">
- <meta name="keywords">
Content is selected from article, main, or div.content, falling back to the entire document.
Output is a Markdown-like text representation:
- Headings h1-h6 map to #-style headings
- Paragraphs become plain text with inline links and images preserved
- Lists become - or numbered list items
Whitespace is collapsed to keep the output readable.

Output format

The output is Markdown-like and optimized for readability, not strict Markdown compliance.

# Welcome

This is a [link](https://example.com/about).

- First item
- Second item

Error handling and resiliency

ScrapeLinksAsync performs best-effort crawling and skips pages that fail to load or parse.
ScrapeContentAsync throws HttpRequestException for non-success status codes.
Cancellation is honored during link crawling and during content fetches.

Limitations and notes

No JavaScript rendering; content must be present in the HTML response.
No robots.txt handling or rate limiting is built in. Be mindful when crawling.
Lowercasing and query/fragment removal may collapse distinct URLs on case-sensitive servers.
In-scope checks use a simple string prefix; paths like /docs and /docs-old are both treated as in-scope.
Root-relative URLs are resolved with scheme and host only, which drops non-default ports.
Only anchor tags (<a href=...>) are used for link discovery.

API reference

`WebsiteScraper`

public async IAsyncEnumerable<string> ScrapeLinksAsync(
    string url,
    int depth = 2,
    CancellationToken cancellationToken = default)

public async ValueTask<PageDetails> ScrapeContentAsync(
    string url,
    CancellationToken cancellationToken = default)

`PageDetails`

public sealed record PageDetails(
    string Url,
    string? Title,
    string? Description,
    string? Keywords,
    string? Content);

Development

Build the library:

dotnet build

Run the preview app:

dotnet run --project ChatAIze.RabbitHole.Preview

License

GPL-3.0-or-later. See LICENSE.txt.

Product	Compatible and additional computed target framework versions.
.NET	net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net10.0
- HtmlAgilityPack (>= 1.12.4)

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
0.3.0	156	12/20/2025
0.2.11	245	11/14/2025
0.2.10	331	9/18/2025
0.2.9	267	4/24/2025
0.2.8	262	3/19/2025
0.2.7	231	11/17/2024
0.2.6	204	11/13/2024
0.2.5	236	10/19/2024
0.2.4	187	10/8/2024
0.2.3	197	9/28/2024
0.2.2	167	9/27/2024
0.2.1	175	9/27/2024
0.2.0	202	9/27/2024
0.1.1	198	9/26/2024
0.1.0	182	9/26/2024

https://github.com/chataize/rabbit-hole/releases