Crumpled.RobotsTxt
3.1.0-beta.1
dotnet add package Crumpled.RobotsTxt --version 3.1.0-beta.1
NuGet\Install-Package Crumpled.RobotsTxt -Version 3.1.0-beta.1
<PackageReference Include="Crumpled.RobotsTxt" Version="3.1.0-beta.1" />
<PackageVersion Include="Crumpled.RobotsTxt" Version="3.1.0-beta.1" />
<PackageReference Include="Crumpled.RobotsTxt" />
paket add Crumpled.RobotsTxt --version 3.1.0-beta.1
#r "nuget: Crumpled.RobotsTxt, 3.1.0-beta.1"
#:package Crumpled.RobotsTxt@3.1.0-beta.1
#addin nuget:?package=Crumpled.RobotsTxt&version=3.1.0-beta.1&prerelease
#tool nuget:?package=Crumpled.RobotsTxt&version=3.1.0-beta.1&prerelease
Crumpled.RobotsTxt
A flexible, configuration-driven robots.txt solution for Umbraco v13, v14, v15, v16 & v17 that protects your non-production environments from search engine indexing by default, while giving you granular control over crawling rules across multiple sites and environments.
Key Features
- π‘οΈ Safe by Default - Blocks all bots by default to prevent accidental indexing of development, staging, or preview environments
- π Multi-Site & Environment-Aware - Configure different robots.txt rules for different domains/hostnames and environments (Production, Development, Staging, etc.)
- π Flexible Rule Configuration - Define reusable rulesets with Allow/Disallow patterns for different user agents
- π€ Content Signals Support - Control AI training and content usage with Content Signals directives
- π Hot Reload - Configuration changes are automatically picked up without requiring an application restart
- πΊοΈ Sitemap Integration - Include sitemap URLs per site
- βοΈ Umbraco Cloud Ready - Default behaviour designed for Umbraco Cloud - Perfect for hiding those often overlooked *.umbraco.io environment domains.
- βοΈ Zero Code Setup - Works out of the box with auto-registration
Install NuGet package
dotnet add package Crumpled.RobotsTxt
Setup
The package automatically registers itself via a Umbraco Composer. No code changes required!
Manual Registration (Advanced)
If you prefer to manually register the package in program.cs, disable the composer:
"Crumpled": {
"RobotsTxt": {
"DisableComposer": true
}
}
Then add to your program.cs:
.AddCrumpledRobotsTxt()
Default Behavior - Protection First
The package prioritizes protecting your content from unintended indexing. When no Sites are configured, smart defaults kick in:
Custom Default: If you specify a
DefaultRuleset, that ruleset will be used as the fallbackUmbraco Cloud Live Environment: If the environment variable
UMBRACO__CLOUD__DEPLOY__ENVIRONMENTNAMEequals"live", all bots are allowed by default:User-agent: * Allow: /All Other Environments: All bots are blocked by default for safety - protecting staging, development, and preview environments:
User-agent: * Disallow: /
β οΈ Note: Once you configure Sites, these defaults are ignored and your custom RuleSets take full control.
Unmatched Domains - Additional Protection
When Sites are configured, any domain that doesn't match the configured HostNames (e.g., temporary preview URLs, forgotten subdomains) will get a protective fallback:
- Custom Default: If you specify a
DefaultRuleset, that ruleset will be used - Otherwise: Blocks all bots for safety:
User-agent: * Disallow: /
This prevents unintended crawling of staging, preview, or other unlisted domains - ensuring only your explicitly configured production domains are indexed.
Configuration Example - Multi-Site Setup
Configure different robots.txt rules for different environments and domains using reusable rulesets:
"Crumpled": {
"RobotsTxt": {
"DefaultRuleset": "NonProduction",
"RuleSets": { // There can be multiple rulesets for complex scenarios!
"Production": {
"Allow": {
"*" : ["/"]
},
"Disallow": {
"*": [ "/cdn-cgi/challenge-platform/", "/cdn-cgi/email-platform/" ]
}
},
"NonProduction": {
"Allow": {
"SemrushBot": [ "/" ],
"SemrushBot-SA": [ "/" ],
"SemrushBot-Desktop": [ "/" ],
"SemrushBot-Mobile": [ "/" ],
"SiteAuditBot": [ "/" ]
},
"Disallow": {
"*": [ "/" ]
}
}
},
"Sites": {
"Prod": {
"HostNames": "www.mysite.com",
"SiteMapDomain": "www.mysite.com",
"RuleSet": "Production"
},
"AnotherProd": {
"HostNames": "www.anothermysite.com",
"SiteMapDomain": "www.anothermysite.com",
"RuleSet": "Production" // or can define alternate production ruleset for this site
},
"Stage": {
"HostNames": "mysite-staging-uksouth01.umbraco.io,staging.mysite.com",
"SiteMapDomain": "staging.mysite.com",
"RuleSet": "NonProduction"
},
"Dev": {
"HostNames": "mysite-dev-uksouth01.umbraco.io,dev.mysite.com",
"SiteMapDomain": "dev.mysite.com",
"RuleSet": "NonProduction"
}
}
}
}
Content Signals Support
Content Signals (contentsignals.org) are Cloudflare's implementation for controlling how automated systems (AI crawlers, search engines) use your content. Content-Signal directives are restrictions on Allow rules only and declare permissions for:
- ai-train: Training or fine-tuning AI models
- search: Building search indexes and providing search results
- ai-input: Inputting content into AI models (RAG, grounding, generative AI search)
How Content Signals Work
Content Signals can be configured at two levels:
- User-agent level (default): A single Content-Signal applies to all Allow paths for that user-agent
- Path-specific level (advanced): Different Content-Signals for different paths under the same user-agent
User-Agent Level Content Signal
When you configure a single ContentSignal for a user-agent, it applies to all Allow paths:
"Allow": {
"googlebot": {
"Paths": ["/blog", "/news"],
"ContentSignal": {
"AiTrain": false,
"Search": true,
"AiInput": false
}
}
}
This generates:
User-agent: googlebot
Content-Signal: /blog ai-train=no, search=yes, ai-input=no
Allow: /blog
Content-Signal: /news ai-train=no, search=yes, ai-input=no
Allow: /news
Each path gets its own Content-Signal directive with the same settings.
Path-Specific Content Signals
For advanced scenarios, you can configure different Content-Signals for different paths under the same user-agent using an array of rules:
"Allow": {
"bingbot": [
{
"Paths": ["/blog", "/news"],
"ContentSignal": {
"AiTrain": true,
"Search": true,
"AiInput": false
}
},
{
"Paths": ["/"],
"ContentSignal": {
"AiTrain": false,
"Search": true,
"AiInput": false
}
}
]
}
This generates:
User-agent: bingbot
Content-Signal: /blog ai-train=yes, search=yes, ai-input=no
Allow: /blog
Content-Signal: /news ai-train=yes, search=yes, ai-input=no
Allow: /news
Content-Signal: ai-train=no, search=yes, ai-input=no
Allow: /
This allows you to permit AI training on your blog content while restricting it for other areas of your site.
Content Signal Instructions Header
You can optionally include a legal header at the top of your robots.txt that explains the Content Signal terms and conditions. Enable this per ruleset:
"RuleSets": {
"Production": {
"IncludeContentSignalInstructions": true,
"ContentSignal": {
"AiTrain": false,
"Search": true,
"AiInput": false
},
"Allow": {
"*": ["/"]
}
}
}
This adds a comprehensive header explaining the Content Signal license terms, including references to EU Directive 2019/790 on copyright. The instructions clarify:
- What constitutes agreement (yes) and restriction (no)
- Definitions of search, ai-input, and ai-train
- Legal basis under EU copyright law
Note: Only enable this if you're using Content Signals in that ruleset, as it adds ~35 lines to the top of your robots.txt.
Configuration within RuleSets
Content Signals are configured within RuleSets and only apply to Allow directives. Disallow rules never include Content-Signal directives.
Default ContentSignal for All Allow Rules
Configure a default ContentSignal at the RuleSet level that applies to all Allow rules:
"RuleSets": {
"Production": {
"ContentSignal": {
"AiTrain": true,
"Search": true,
"AiInput": true
},
"Allow": {
"*": ["/"],
"Googlebot": ["/"]
},
"Disallow": {
"*": ["/admin/"]
}
}
}
Both * and Googlebot Allow rules will get the same Content-Signal.
Agent-Specific ContentSignal
Override the default ContentSignal for specific user agents:
"RuleSets": {
"Production": {
"IncludeContentSignalInstructions": true,
"ContentSignal": {
"AiTrain": true,
"Search": true,
"AiInput": true
},
"Allow": {
"OAI-SearchBot": {
"Paths": ["/"],
"ContentSignal": {
"AiTrain": false,
"Search": true,
"AiInput": false
}
},
"googlebot": {
"Paths": ["/blog", "/news"],
"ContentSignal": {
"AiTrain": false,
"Search": true,
"AiInput": false
}
},
"*": ["/"]
},
"Disallow": {
"*": ["/cdn-cgi/"]
}
}
}
This generates:
# As a condition of accessing this website, you agree to abide by
# the following content signals:
# ... (legal header text) ...
User-agent: googlebot
Content-Signal: /blog ai-train=no, search=yes, ai-input=no
Allow: /blog
Content-Signal: /news ai-train=no, search=yes, ai-input=no
Allow: /news
User-agent: OAI-SearchBot
Content-Signal: ai-train=no, search=yes, ai-input=no
Allow: /
User-agent: *
Content-Signal: ai-train=yes, search=yes, ai-input=yes
Allow: /
User-agent: *
Disallow: /cdn-cgi/
Notice:
- Specific user agents (googlebot, OAI-SearchBot) appear before the wildcard
* - Each user-agent gets its own ContentSignal - googlebot and OAI-SearchBot have restricted permissions, while
*allows everything - The legal header is included because
IncludeContentSignalInstructions: true - Each path gets its own Content-Signal directive paired with its Allow directive
Simple and Complex Allow Rules
You can mix simple array format and complex object format in the same RuleSet:
- Simple format:
"UserAgent": ["/path1", "/path2"]- Uses default ContentSignal from RuleSet - Complex format:
"UserAgent": { "Paths": [...], "ContentSignal": {...} }- Uses agent-specific ContentSignal
"RuleSets": {
"Production": {
"ContentSignal": {
"AiTrain": true,
"Search": true,
"AiInput": true
},
"Allow": {
"Googlebot": ["/"], // Simple - uses default ContentSignal (ai-train=yes)
"OAI-SearchBot": { // Complex - overrides with ai-train=no
"Paths": ["/"],
"ContentSignal": {
"AiTrain": false,
"Search": true,
"AiInput": false
}
}
}
}
}
This lets you set a permissive default for most bots while restricting specific ones like AI search crawlers.
User-Agent Ordering
Specific user agents are always ordered alphabetically and appear before the wildcard *. This follows robots.txt best practices where more specific rules should be evaluated before general rules.
"Allow": {
"OAI-SearchBot": ["/"],
"googlebot": ["/blog"],
"*": ["/"]
}
Will always output in this order:
User-agent: googlebot
...
User-agent: OAI-SearchBot
...
User-agent: *
...
Common Policies
Allow Search Only (no AI):
"ContentSignal": {
"AiTrain": false,
"Search": true,
"AiInput": false
}
Allow Search & AI Input (no training):
"ContentSignal": {
"AiTrain": false,
"Search": true,
"AiInput": true
}
Allow All:
"ContentSignal": {
"AiTrain": true,
"Search": true,
"AiInput": true
}
Disallow All (most restrictive):
"ContentSignal": {
"AiTrain": false,
"Search": false,
"AiInput": false
}
Crawl-delay
Control how frequently crawlers can request pages from your site on a per-user-agent basis. The Crawl-delay directive requests crawlers to wait a specified number of seconds between successive requests.
Configuration:
Crawl-delay is configured per user-agent in the Allow rules using the complex format:
"RuleSets": {
"Production": {
"Allow": {
"Googlebot": {
"Paths": ["/"],
"CrawlDelay": 10
},
"Bingbot": {
"Paths": ["/"],
"CrawlDelay": 5
}
}
}
}
This generates:
User-agent: Bingbot
Crawl-delay: 5
Allow: /
User-agent: Googlebot
Crawl-delay: 10
Allow: /
Notes:
- Crawl-delay is specified in seconds (integer)
- Only available in the complex Allow rule format (not simple string array)
- If a user-agent appears in both Allow and Disallow with different Crawl-delay values, Allow takes precedence
- Not all crawlers respect Crawl-delay (Google and Bing use their own rate limiting via Search Console/Webmaster Tools)
- Typical values: 1-10 seconds for busy sites, 0.5-2 seconds for moderate traffic
Use cases:
- Protect server resources during peak traffic
- Slow down aggressive crawlers
- Different rates for different bots (e.g., slower for less important crawlers)
Cache Control
Control how long browsers and crawlers should cache the robots.txt file using the MaxAge property. This sets the Cache-Control: max-age header in the HTTP response.
"Crumpled": {
"RobotsTxt": {
"MaxAge": "1.00:00:00" // 1 day (default)
}
}
TimeSpan Format Examples:
"1.00:00:00"- 1 day (default)"12:00:00"- 12 hours"00:30:00"- 30 minutes"7.00:00:00"- 7 days
Default: 1 day (86,400 seconds)
Why it matters:
- robots.txt doesn't change frequently, so longer cache times reduce server load
- Search engines and crawlers respect cache headers to minimize repeated requests
- Shorter cache times allow faster propagation of rule changes if needed
Recommended values:
- Production sites:
1.00:00:00to7.00:00:00(1-7 days) - rules rarely change - Active development:
00:30:00to01:00:00(30 minutes - 1 hour) - faster updates during testing
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net8.0
- Umbraco.Cms.Web.Website (>= 13.0.0)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
| Version | Downloads | Last Updated |
|---|---|---|
| 3.1.0-beta.1 | 36 | 3/18/2026 |
| 3.0.3 | 253 | 2/14/2026 |
| 3.0.2 | 116 | 2/12/2026 |
| 3.0.1 | 90 | 2/12/2026 |
| 3.0.0 | 96 | 2/12/2026 |