ChatAIze.RabbitHole
0.3.0
dotnet add package ChatAIze.RabbitHole --version 0.3.0
NuGet\Install-Package ChatAIze.RabbitHole -Version 0.3.0
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="ChatAIze.RabbitHole" Version="0.3.0" />
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="ChatAIze.RabbitHole" Version="0.3.0" />
<PackageReference Include="ChatAIze.RabbitHole" />
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add ChatAIze.RabbitHole --version 0.3.0
The NuGet Team does not provide support for this client. Please contact its maintainers for support.
#r "nuget: ChatAIze.RabbitHole, 0.3.0"
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package ChatAIze.RabbitHole@0.3.0
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=ChatAIze.RabbitHole&version=0.3.0
#tool nuget:?package=ChatAIze.RabbitHole&version=0.3.0
The NuGet Team does not provide support for this client. Please contact its maintainers for support.
Rabbit Hole
Rabbit Hole is a small, deterministic web text scraper for .NET. It discovers links within a root URL and extracts readable text from HTML pages. The output is a Markdown-like string suited for indexing, summarization, or offline processing.
Use cases
- Build a lightweight search index for a site
- Feed content into an LLM or summarization pipeline
- Snapshot documentation pages for offline use
- Validate a sitemap against actual in-page links
Features
- Async breadth-first link discovery with de-duplication
- Scope control to the root URL prefix
- Skips common non-HTML assets by extension
- HTML-only parsing based on Content-Type
- Metadata extraction: title, meta description, meta keywords
- Markdown-like content output for headings, paragraphs, and lists
- Inline links and images preserved in the output
- Cancellation support for long-running crawls
Requirements
- .NET 10 (net10.0)
Install
dotnet add package ChatAIze.RabbitHole
Quick start
using ChatAIze.RabbitHole;
var scraper = new WebsiteScraper();
await foreach (var link in scraper.ScrapeLinksAsync("https://example.com", depth: 2))
{
Console.WriteLine(link);
}
var page = await scraper.ScrapeContentAsync("https://example.com");
Console.WriteLine(page.Title);
Console.WriteLine(page.Content);
Usage patterns
Crawl links, then fetch content
using ChatAIze.RabbitHole;
var scraper = new WebsiteScraper();
await foreach (var link in scraper.ScrapeLinksAsync("https://example.com", depth: 3))
{
var page = await scraper.ScrapeContentAsync(link);
Console.WriteLine($"{page.Url} -> {page.Title}");
}
Cancel a long crawl
using ChatAIze.RabbitHole;
var scraper = new WebsiteScraper();
using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(30));
await foreach (var link in scraper.ScrapeLinksAsync("https://example.com", depth: 3, cts.Token))
{
Console.WriteLine(link);
}
Filter links before scraping content
using ChatAIze.RabbitHole;
var scraper = new WebsiteScraper();
await foreach (var link in scraper.ScrapeLinksAsync("https://example.com", depth: 3))
{
if (!link.Contains("/docs/"))
{
continue;
}
var page = await scraper.ScrapeContentAsync(link);
Console.WriteLine(page.Content);
}
Link discovery details
- The root URL is always yielded first.
- The crawl is breadth-first; the root is depth 1.
- Links discovered on a page are yielded immediately.
- Pages are only fetched if their depth is strictly less than the
depthparameter.- Example:
depth: 2fetches the root page and yields its links, but does not fetch those links. - Example:
depth: 3fetches the root page and each linked page once, but does not go deeper.
- Example:
- URLs are normalized by trimming, lowercasing, and removing query strings and fragments.
- Only URLs that start with the root URL prefix are considered in-scope.
- Root-relative links (starting with
/) are resolved against the root host. - Relative links without a leading slash are ignored.
- The crawler ignores
mailto:,tel:, and anchor-only (#...) links. - Responses are only parsed when the Content-Type is
text/html. - Non-HTML assets are filtered by extension (see
WebsiteScraperfor the list).
Content extraction details
- Non-HTML responses return a
PageDetailsinstance with null metadata and content. - Standard metadata is extracted when available:
<title><meta name="description"><meta name="keywords">
- Content is selected from
article,main, ordiv.content, falling back to the entire document. - Output is a Markdown-like text representation:
- Headings
h1-h6map to#-style headings - Paragraphs become plain text with inline links and images preserved
- Lists become
-or numbered list items
- Headings
- Whitespace is collapsed to keep the output readable.
Output format
The output is Markdown-like and optimized for readability, not strict Markdown compliance.
# Welcome
This is a [link](https://example.com/about).
- First item
- Second item
Error handling and resiliency
ScrapeLinksAsyncperforms best-effort crawling and skips pages that fail to load or parse.ScrapeContentAsyncthrowsHttpRequestExceptionfor non-success status codes.- Cancellation is honored during link crawling and during content fetches.
Limitations and notes
- No JavaScript rendering; content must be present in the HTML response.
- No robots.txt handling or rate limiting is built in. Be mindful when crawling.
- Lowercasing and query/fragment removal may collapse distinct URLs on case-sensitive servers.
- In-scope checks use a simple string prefix; paths like
/docsand/docs-oldare both treated as in-scope. - Root-relative URLs are resolved with scheme and host only, which drops non-default ports.
- Only anchor tags (
<a href=...>) are used for link discovery.
API reference
WebsiteScraper
public async IAsyncEnumerable<string> ScrapeLinksAsync(
string url,
int depth = 2,
CancellationToken cancellationToken = default)
public async ValueTask<PageDetails> ScrapeContentAsync(
string url,
CancellationToken cancellationToken = default)
PageDetails
public sealed record PageDetails(
string Url,
string? Title,
string? Description,
string? Keywords,
string? Content);
Development
Build the library:
dotnet build
Run the preview app:
dotnet run --project ChatAIze.RabbitHole.Preview
License
GPL-3.0-or-later. See LICENSE.txt.
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.
-
net10.0
- HtmlAgilityPack (>= 1.12.4)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
| Version | Downloads | Last Updated |
|---|---|---|
| 0.3.0 | 55 | 12/20/2025 |
| 0.2.11 | 231 | 11/14/2025 |
| 0.2.10 | 318 | 9/18/2025 |
| 0.2.9 | 247 | 4/24/2025 |
| 0.2.8 | 245 | 3/19/2025 |
| 0.2.7 | 215 | 11/17/2024 |
| 0.2.6 | 190 | 11/13/2024 |
| 0.2.5 | 219 | 10/19/2024 |
| 0.2.4 | 170 | 10/8/2024 |
| 0.2.3 | 177 | 9/28/2024 |
| 0.2.2 | 149 | 9/27/2024 |
| 0.2.1 | 160 | 9/27/2024 |
| 0.2.0 | 181 | 9/27/2024 |
| 0.1.1 | 180 | 9/26/2024 |
| 0.1.0 | 167 | 9/26/2024 |