Robots.Txt.Parser 1.0.0-rc2

This is a prerelease version of Robots.Txt.Parser.
There is a newer version of this package available.
See the version list below for details.
dotnet add package Robots.Txt.Parser --version 1.0.0-rc2
                    
NuGet\Install-Package Robots.Txt.Parser -Version 1.0.0-rc2
                    
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="Robots.Txt.Parser" Version="1.0.0-rc2" />
                    
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="Robots.Txt.Parser" Version="1.0.0-rc2" />
                    
Directory.Packages.props
<PackageReference Include="Robots.Txt.Parser" />
                    
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add Robots.Txt.Parser --version 1.0.0-rc2
                    
#r "nuget: Robots.Txt.Parser, 1.0.0-rc2"
                    
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package Robots.Txt.Parser@1.0.0-rc2
                    
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=Robots.Txt.Parser&version=1.0.0-rc2&prerelease
                    
Install as a Cake Addin
#tool nuget:?package=Robots.Txt.Parser&version=1.0.0-rc2&prerelease
                    
Install as a Cake Tool

Table of Contents

Overview

License Nuget Continuous Integration Workflow Coverage Status

Parse robots.txt and sitemaps using dotnet. Supports the proposed RFC9309 standard, as well as the following common, non-standard directives:

  • Sitemap
  • Host
  • Crawl-delay

Why Build Yet Another Parser?

There are several robots.txt and sitemap parsers that already exist, however they all suffer from their lack of flexibility.

This library is based upon HttpClient, making it very familiar, easy to use and adaptable to your needs. Since you have full control over the HttpClient, you are able to configure custom message handlers to intercept outgoing requests and responses. For example, you may want to add custom headers on a request, configure additional logging or set up a retry policy.

There is also the possibility to extend this library to support protocols other than HTTP, such as FTP.

Features

Name Supported Priority
HTTP/HTTPS ✔️
FTPS/FTPS 0.1
Wildcard (*) User-agent ✔️
Allow & disallow rules ✔️
End-of-match ($) and wildcard (*) paths ✔️
Sitemap entries ✔️
Host directive ✔️
Crawl-delay directive ✔️
Sitemaps XML format ✔️
RSS 2.0 feeds 0.8
Atom 0.3/1.0 feeds 0.8
Simple text sitemaps 0.5
Caching support 0.3

Usage

Install the package via NuGet.

dotnet add package Robots.Txt.Parser

Minimal Example

First, create an implementation of IWebsiteMetadata for the host address that you wish to use.

public class GitHubWebsite : IWebsiteMetadata
{
    public static Uri BaseAddress => new("https://www.github.com");
}

Next, create an instance of RobotWebClient<TWebsite>.

With Dependency Injection

public void ConfigureServices(IServiceCollection services)
{
    services.AddHttpClient<IRobotWebClient<GitHubWebsite>, RobotWebClient<GitHubWebsite>>();
}

Without Dependency Injection

using var httpClient = new HttpClient();
var robotWebClient = new RobotWebClient<GitHubWebsite>(httpClient);

Web Crawler Example

Optionally, specify message handlers to modify the HTTP pipeline. For example, you may be attempting to crawl the website and therefore will want to reduce the rate of your requests, to do so responsibly. You can achieve this by adding a custom HttpMessageHandler to the pipeline.

public class ResponsibleCrawlerHttpClientHandler : DelegatingHandler
{
    protected override async Task<HttpResponseMessage> SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
    {
        var response = await base.SendAsync(request, cancellationToken);
        await Task.Delay(TimeSpan.FromSeconds(1), cancellationToken);
        return response;
    }
}

With Dependency Injection

public void ConfigureServices(IServiceCollection services)
{
    services.TryAddTransient<ResponsibleCrawlerHttpClientHandler>();
    services.AddHttpClient<IRobotWebClient<GitHubWebsite>, RobotWebClient<GitHubWebsite>>()
            .AddPrimaryHttpMessageHandler<ResponsibleCrawlerHttpClientHandler>();
}

Without Dependency Injection

var httpClientHandler = new HttpClientHandler
{
    InnerHandler = new ResponsibleCrawlerHttpClientHandler()
};
using var httpClient = new HttpClient(httpClientHandler);
var robotWebClient = new RobotWebClient<GitHubWebsite>(httpClient);

Retrieving the Sitemap

var robotsTxt = await robotWebClient.LoadRobotsTxtAsync();
// providing a datetime only retrieves sitemap items modified since this datetime
var modifiedSince = new DateTime(2023, 01, 01);
// sitemaps are scanned recursively and combined into single Sitemap object
// even if robots.txt does not contain sitemap directive, looks for a sitemap at {TWebsite.BaseAddress}/sitemap.xml
var sitemap = await robotsTxt.LoadSitemapAsync(modifiedSince);

Checking a Rule

var robotsTxt = await robotWebClient.LoadRobotsTxtAsync();
// if rules for the specific User-Agent are not present, it falls back to the wildcard *
var anyRulesDefined = robotsTxt.TryGetRules("SomeBotUserAgent", out var rules);
// even if no wildcard rules exist, an empty rule-checker is returned
var isAllowed = rules.IsAllowed("/some/path");

Getting Preferred Host

var robotsTxt = await robotWebClient.LoadRobotsTxtAsync();
// host value will fall back to TWebsite.BaseAddress host, if no directive exists
var hasHostDirective = robotsTxt.TryGetHost(out var host);

Getting Crawl Delay

var robotsTxt = await robotWebClient.LoadRobotsTxtAsync();
// if no Crawl-delay directive exists, crawl delay will be 0
var hasCrawlDelayDirective = robotsTxt.TryGetCrawlDelay(out var crawlDelay);

Contributing

Issues and pull requests are encouraged. For large or breaking changes, it is suggested to open an issue first, to discuss before proceeding.

If you find this project useful, please give it a star.

Product Compatible and additional computed target framework versions.
.NET net7.0 is compatible.  net7.0-android was computed.  net7.0-ios was computed.  net7.0-maccatalyst was computed.  net7.0-macos was computed.  net7.0-tvos was computed.  net7.0-windows was computed.  net8.0 was computed.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed.  net9.0 was computed.  net9.0-android was computed.  net9.0-browser was computed.  net9.0-ios was computed.  net9.0-maccatalyst was computed.  net9.0-macos was computed.  net9.0-tvos was computed.  net9.0-windows was computed.  net10.0 was computed.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.
  • net7.0

    • No dependencies.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last Updated
1.0.0 126 1/28/2026
1.0.0-rc9 107 1/10/2026
1.0.0-rc8 263 9/2/2023
1.0.0-rc7 214 8/28/2023
1.0.0-rc6 200 8/28/2023
1.0.0-rc5 202 8/27/2023
1.0.0-rc4 188 8/27/2023
1.0.0-rc3 199 8/27/2023
1.0.0-rc2 190 8/26/2023
1.0.0-rc12 94 1/27/2026
1.0.0-rc11 92 1/18/2026
1.0.0-rc10 96 1/15/2026
1.0.0-rc1 204 8/26/2023