RobotsTxtParser 2025.9.4
dotnet add package RobotsTxtParser --version 2025.9.4
NuGet\Install-Package RobotsTxtParser -Version 2025.9.4
<PackageReference Include="RobotsTxtParser" Version="2025.9.4" />
<PackageVersion Include="RobotsTxtParser" Version="2025.9.4" />
<PackageReference Include="RobotsTxtParser" />
paket add RobotsTxtParser --version 2025.9.4
#r "nuget: RobotsTxtParser, 2025.9.4"
#:package RobotsTxtParser@2025.9.4
#addin nuget:?package=RobotsTxtParser&version=2025.9.4
#tool nuget:?package=RobotsTxtParser&version=2025.9.4
RobotsTxtParser
A simple .NET Core 9 library for parsing robots.txt
files.
Supports:
- Reading and parsing
User-agent
,Allow
,Disallow
,Crawl-delay
, andSitemap
directives - Checking whether a given URL path is allowed for a particular user-agent, with three different “Allow vs Disallow” strategies
- Retrieving crawl-delays (as
TimeSpan
) for specified user-agents, with optional fallback overloads - Collecting all
Sitemap
URLs declared in arobots.txt
The parser is fully immutable after construction and exposes a clean, intuitive API.
Table of Contents
Installation
Via .NET CLI
dotnet add package RobotsTxtParser --version 2025.9.4
Via Package Manager (Visual Studio)
Install-Package RobotsTxtParser -Version 2025.9.4
Direct Reference
If you prefer to reference the local project, copy the RobotsTxtParser
folder into your solution and add the project as a reference. The library targets net9.0 with C# 11 and has <Nullable>enable</Nullable>
by default.
Quick Start
using RobotsTxtParser;
// Suppose you have robots.txt content as a string:
string robotsTxtContent = @"
User-agent: Googlebot
Disallow: /private
Allow: /public
User-agent: *
Disallow: /tmp
Crawl-delay: 1.5
Sitemap: https://example.com/sitemap.xml
";
// Parse it:
var robots = Robots.Load(robotsTxtContent);
// Check if "/public/page.html" is allowed for "Googlebot":
bool canGooglebotAccess = robots.IsPathAllowed("Googlebot", "/public/page.html");
// Check crawl-delay for a generic crawler:
TimeSpan defaultDelay = robots.CrawlDelay("SomeOtherBot");
// Retrieve all sitemap URLs:
foreach (var site in robots.Sitemaps)
{
if (site.Url != null)
Console.WriteLine($"Valid sitemap URL: {site.Url}");
else
Console.WriteLine($"Malformed sitemap entry: {site.Value}");
}
For more use-cases, review the test units inside the RobotsTxtParser.Tests
project.
API Reference
Robots
class
namespace RobotsTxtParser
{
public class Robots : IRobotsParser
{
// Properties
public string Raw { get; } // Original robots.txt content
public List<Sitemap> Sitemaps { get; private set; }
public bool Malformed { get; private set; } // True if any line was malformed
public bool HasRules { get; private set; } // True if ≥1 Access or Crawl-delay rule parsed
public bool IsAnyPathDisallowed { get; private set; }
public AllowRuleImplementation AllowRuleImplementation { get; set; }
// Static Factory
public static Robots Load(string content);
// IRobotsParser Implementation
public bool IsPathAllowed(string userAgent, string path);
public TimeSpan CrawlDelay(string userAgent);
public TimeSpan CrawlDelay(string userAgent, int fallbackAmount);
public TimeSpan CrawlDelay(string userAgent, TimeSpan fallbackAmount);
}
}
Load(string content)
Parses the entirerobots.txt
content and returns aRobots
instance. Ifcontent
isnull
or whitespace, no rules are parsed andHasRules == false
.IsPathAllowed(string userAgent, string path) : bool
Returnstrue
if the givenpath
is allowed for the specifieduserAgent
, after normalizingpath
. ThrowsArgumentException
ifuserAgent
isnull
/empty/whitespace. If there are no rules, or if no Disallow rules exist, always returnstrue
. The logic respects the chosenAllowRuleImplementation
.CrawlDelay(string userAgent) : TimeSpan
Returns the crawl-delay (in milliseconds, as aTimeSpan
) for theuserAgent
. ThrowsArgumentException
ifuserAgent
isnull
/empty/whitespace. If no crawl-delay rule matches, returnsTimeSpan.Zero
. Specific rules are checked first; if none match, the global (*
) rule is used.CrawlDelay(string userAgent, int fallbackAmount) : TimeSpan
Same asCrawlDelay(string)
, but if no matching rule (specific or global) is found, returnsTimeSpan.FromMilliseconds(fallbackAmount)
instead of zero.CrawlDelay(string userAgent, TimeSpan fallbackAmount) : TimeSpan
Same as above, but the fallback is aTimeSpan
directly. If no rule is found, returnsfallbackAmount
.Raw
The unmodified string passed intoLoad(...)
.Sitemaps
A list ofSitemap
objects representing eachSitemap:
directive.Malformed
true
if at least one line was out of expected context (e.g.Disallow
before anyUser-agent
, or an unrecognized directive). Parsed valid rules still apply.HasRules
true
if at least oneAllow
/Disallow
orCrawl-delay
directive was successfully recorded under someUser-agent
.IsAnyPathDisallowed
true
if there is at least oneDisallow
with a non-empty path (meaning not “Disallow: ”).AllowRuleImplementation
Determines how to resolve conflicts when multipleAllow
/Disallow
rules match a path. Default isMoreSpecific
.
IRobotsParser
interface
namespace RobotsTxtParser
{
public interface IRobotsParser
{
bool IsPathAllowed(string userAgent, string path);
TimeSpan CrawlDelay(string userAgent);
TimeSpan CrawlDelay(string userAgent, int fallbackAmount);
TimeSpan CrawlDelay(string userAgent, TimeSpan fallbackAmount);
}
}
AllowRuleImplementation
enum
namespace RobotsTxtParser
{
public enum AllowRuleImplementation
{
Standard, // Pick the matched rule with lowest “order” (first-seen)
AllowOverrides, // If any matching rule is Allow, path is allowed
MoreSpecific // Pick the rule with longest Path, then by order
}
}
Sitemap
class
namespace RobotsTxtParser
{
public class Sitemap
{
public string Value { get; } // Raw text after “Sitemap:” (never null)
public Uri? Url { get; } // Parsed absolute Uri, or null if invalid
internal static Sitemap FromLine(Line line);
}
}
Use
robots.Sitemaps
after callingRobots.Load(...)
; each item has:Value
– the exact substring fromrobots.txt
after “Sitemap:”Url
– aUri
ifValue
is a well-formed absolute URL; otherwisenull
.
Usage Examples
Basic “Allow/Disallow” check
string robotsTxt = @"
User-agent: *
Disallow: /private
Allow: /public
";
var robots = Robots.Load(robotsTxt);
// Default is MoreSpecific
Console.WriteLine(robots.IsPathAllowed("anybot", "/public/index.html")); // True
Console.WriteLine(robots.IsPathAllowed("anybot", "/private/data.txt")); // False
Switching “Allow” rule strategy
string robotsTxt = @"
User-agent: *
Disallow: /foo
Allow: /foo
";
var r = Robots.Load(robotsTxt);
// Standard: pick first-seen → Disallow
r.AllowRuleImplementation = AllowRuleImplementation.Standard;
Console.WriteLine(r.IsPathAllowed("Bot", "/foo")); // False
// AllowOverrides: any Allow wins → allowed
r.AllowRuleImplementation = AllowRuleImplementation.AllowOverrides;
Console.WriteLine(r.IsPathAllowed("Bot", "/foo")); // True
// MoreSpecific: tie-break by order (since both are "/foo")
r.AllowRuleImplementation = AllowRuleImplementation.MoreSpecific;
Console.WriteLine(r.IsPathAllowed("Bot", "/foo")); // False (Disallow first)
Crawl-delay retrieval
string robotsTxt = @"
User-agent: MyBot
Crawl-delay: 4.25
User-agent: *
Crawl-delay: 2
";
var robots = Robots.Load(robotsTxt);
// “MyBot” → 4250 ms
TimeSpan myDelay = robots.CrawlDelay("MyBot");
Console.WriteLine(myDelay.TotalMilliseconds); // 4250
// Other bots → 2000 ms
TimeSpan otherDelay = robots.CrawlDelay("OtherBot");
Console.WriteLine(otherDelay.TotalMilliseconds); // 2000
// If no matching rule (and no global "*"), returns TimeSpan.Zero
var empty = Robots.Load(@"User-agent: BotOnly");
Console.WriteLine(empty.CrawlDelay("BotOnly") == TimeSpan.Zero); // True
Crawl-delay with fallback overloads
string robotsTxt = @"
User-agent: BotA
Crawl-delay: 3
User-agent: *
Crawl-delay: 1
";
var robots = Robots.Load(robotsTxt);
// Specific rule exists (3s):
TimeSpan result1 = robots.CrawlDelay("BotA", 10000);
Console.WriteLine(result1.TotalMilliseconds); // 3000
// No specific for "OtherBot" → global (1s):
TimeSpan result2 = robots.CrawlDelay("OtherBot", 5000);
Console.WriteLine(result2.TotalMilliseconds); // 1000
// If no global either, returns fallback:
var limited = Robots.Load(@"User-agent: BotX
Crawl-delay: 2.5");
TimeSpan fallbackTs = TimeSpan.FromMilliseconds(750);
TimeSpan result3 = limited.CrawlDelay("NoMatch", fallbackTs);
Console.WriteLine(result3.TotalMilliseconds); // 750
Extracting all Sitemap URLs
string robotsTxt = @"
User-agent: *
Sitemap: https://example.com/sitemap1.xml
Sitemap: not_a_real_url
Sitemap: https://cdn.example.com/other-sitemap.xml
";
var robots = Robots.Load(robotsTxt);
foreach (var site in robots.Sitemaps)
{
Console.WriteLine($"Raw value: '{site.Value}'");
if (site.Url != null)
Console.WriteLine($" Parsed URI: {site.Url}");
else
Console.WriteLine(" (Invalid URI)");
}
// Output:
// Raw value: 'https://example.com/sitemap1.xml'
// Parsed URI: https://example.com/sitemap1.xml
// Raw value: 'not_a_real_url'
// (Invalid URI)
// Raw value: 'https://cdn.example.com/other-sitemap.xml'
// Parsed URI: https://cdn.example.com/other-sitemap.xml
Handling malformed lines
string content = @"
Disallow: /private # no preceding User-agent → malformed
User-agent: *
Allow: /public
FooBar: /ignored # unknown field → malformed
";
var robots = Robots.Load(content);
Console.WriteLine($"Malformed? {robots.Malformed}"); // True
Console.WriteLine($"HasRules? {robots.HasRules}"); // True (because “Allow” under valid UA)
Console.WriteLine(robots.IsPathAllowed("any", "/private")); // True (early Disallow ignored)
Console.WriteLine(robots.IsPathAllowed("any", "/public")); // True
Notes & Caveats
Normalization of
path
IsPathAllowed(...)
callsNormalizePath(path)
:- Converts
null
or whitespace to"/"
. - Ensures a leading
/
. - Collapses repeated
//
into a single/
.
- Converts
Matching logic strips the leading
/
before comparing to rule paths.
Wildcard &
$
support*
in a rule path matches any sequence of characters.- A trailing
$
means “end-of-string” match. - Internally,
IsPathMatch(pathWithoutSlash, rulePathWithoutSlash)
implements these recursively.
Case-insensitive matching
- Directive names (
User-agent
,Allow
,Disallow
, etc.) are matched case-insensitively. - User-agent value matching (in
AccessRule
andCrawlDelayRule
) is also case-insensitive substring match.
- Directive names (
Malformed lines
- A line is marked malformed if it appears out of context (e.g.
Disallow
before anyUser-agent
) or if the field name is unrecognized. - Malformed lines set
robots.Malformed = true
, but valid rules under validUser-agent
still apply.
- A line is marked malformed if it appears out of context (e.g.
Global (
*
) rules- A rule’s
UserAgent
of"*"
is stored in global lists. - If no specific rule matches a given user-agent, the parser falls back to the global rule.
- If multiple global rules exist, the first one (lowest
Order
) is used unlessMoreSpecific
is in effect.
- A rule’s
License
RobotsTxtParser is licensed under the GNU Affero General Public License, version 3 (AGPL-3.0-or-later). See LICENSE for full text.
Commercial Licensing
While RobotsTxtParser is available under the AGPL-3.0 for all free/open-source usage, a separate commercial license is required to incorporate this code into proprietary or closed-source products without adhering to AGPL’s copyleft obligations.
To purchase a commercial license, please contact:
Hossein Esmati
Email: desmati@gmail.com
The commercial license will be provided under mutually agreed terms, which supersede AGPL-3.0 for your proprietary usage.
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net9.0 is compatible. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net9.0
- No dependencies.
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
Initial release of RobotsTxtParser v2025.9.4 under AGPL‐3.0. Commercial licensing available via email.