Crumpled.RobotsTxt 3.1.0-beta.1

This is a prerelease version of Crumpled.RobotsTxt.
dotnet add package Crumpled.RobotsTxt --version 3.1.0-beta.1
                    
NuGet\Install-Package Crumpled.RobotsTxt -Version 3.1.0-beta.1
                    
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="Crumpled.RobotsTxt" Version="3.1.0-beta.1" />
                    
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="Crumpled.RobotsTxt" Version="3.1.0-beta.1" />
                    
Directory.Packages.props
<PackageReference Include="Crumpled.RobotsTxt" />
                    
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add Crumpled.RobotsTxt --version 3.1.0-beta.1
                    
#r "nuget: Crumpled.RobotsTxt, 3.1.0-beta.1"
                    
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package Crumpled.RobotsTxt@3.1.0-beta.1
                    
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=Crumpled.RobotsTxt&version=3.1.0-beta.1&prerelease
                    
Install as a Cake Addin
#tool nuget:?package=Crumpled.RobotsTxt&version=3.1.0-beta.1&prerelease
                    
Install as a Cake Tool

Crumpled.RobotsTxt

A flexible, configuration-driven robots.txt solution for Umbraco v13, v14, v15, v16 & v17 that protects your non-production environments from search engine indexing by default, while giving you granular control over crawling rules across multiple sites and environments.

Key Features

  • πŸ›‘οΈ Safe by Default - Blocks all bots by default to prevent accidental indexing of development, staging, or preview environments
  • 🌍 Multi-Site & Environment-Aware - Configure different robots.txt rules for different domains/hostnames and environments (Production, Development, Staging, etc.)
  • πŸ“ Flexible Rule Configuration - Define reusable rulesets with Allow/Disallow patterns for different user agents
  • πŸ€– Content Signals Support - Control AI training and content usage with Content Signals directives
  • πŸ”„ Hot Reload - Configuration changes are automatically picked up without requiring an application restart
  • πŸ—ΊοΈ Sitemap Integration - Include sitemap URLs per site
  • ☁️ Umbraco Cloud Ready - Default behaviour designed for Umbraco Cloud - Perfect for hiding those often overlooked *.umbraco.io environment domains.
  • βš™οΈ Zero Code Setup - Works out of the box with auto-registration

Install NuGet package

dotnet add package Crumpled.RobotsTxt

Setup

The package automatically registers itself via a Umbraco Composer. No code changes required!

Manual Registration (Advanced)

If you prefer to manually register the package in program.cs, disable the composer:

"Crumpled": {
  "RobotsTxt": {
    "DisableComposer": true
  }
}

Then add to your program.cs:

.AddCrumpledRobotsTxt()

Default Behavior - Protection First

The package prioritizes protecting your content from unintended indexing. When no Sites are configured, smart defaults kick in:

  • Custom Default: If you specify a DefaultRuleset, that ruleset will be used as the fallback

  • Umbraco Cloud Live Environment: If the environment variable UMBRACO__CLOUD__DEPLOY__ENVIRONMENTNAME equals "live", all bots are allowed by default:

    User-agent: *
    Allow: /
    
  • All Other Environments: All bots are blocked by default for safety - protecting staging, development, and preview environments:

    User-agent: *
    Disallow: /
    

⚠️ Note: Once you configure Sites, these defaults are ignored and your custom RuleSets take full control.

Unmatched Domains - Additional Protection

When Sites are configured, any domain that doesn't match the configured HostNames (e.g., temporary preview URLs, forgotten subdomains) will get a protective fallback:

  • Custom Default: If you specify a DefaultRuleset, that ruleset will be used
  • Otherwise: Blocks all bots for safety:
    User-agent: *
    Disallow: /
    

This prevents unintended crawling of staging, preview, or other unlisted domains - ensuring only your explicitly configured production domains are indexed.

Configuration Example - Multi-Site Setup

Configure different robots.txt rules for different environments and domains using reusable rulesets:

"Crumpled": {
  "RobotsTxt": {
    "DefaultRuleset": "NonProduction",
    "RuleSets": { // There can be multiple rulesets for complex scenarios!
      "Production": {
        "Allow": {
          "*" : ["/"]
        },
        "Disallow": {
          "*": [ "/cdn-cgi/challenge-platform/", "/cdn-cgi/email-platform/" ]
        }
      },
      "NonProduction": { 
        "Allow": {
          "SemrushBot": [ "/" ],
          "SemrushBot-SA": [ "/" ],
          "SemrushBot-Desktop": [ "/" ],
          "SemrushBot-Mobile": [ "/" ],
          "SiteAuditBot": [ "/" ]
        },
        "Disallow": {
          "*": [ "/" ]
        }
      }
    },
    "Sites": {
      "Prod": {
        "HostNames": "www.mysite.com",
        "SiteMapDomain": "www.mysite.com",
        "RuleSet": "Production"
      },
      "AnotherProd": {
        "HostNames": "www.anothermysite.com",
        "SiteMapDomain": "www.anothermysite.com",
        "RuleSet": "Production" // or can define alternate production ruleset for this site
      },
      "Stage": {
        "HostNames": "mysite-staging-uksouth01.umbraco.io,staging.mysite.com", 
        "SiteMapDomain": "staging.mysite.com",
        "RuleSet": "NonProduction"
      },
      "Dev": {
        "HostNames": "mysite-dev-uksouth01.umbraco.io,dev.mysite.com",
        "SiteMapDomain": "dev.mysite.com",
        "RuleSet": "NonProduction"
      }
    }
  }
}

Content Signals Support

Content Signals (contentsignals.org) are Cloudflare's implementation for controlling how automated systems (AI crawlers, search engines) use your content. Content-Signal directives are restrictions on Allow rules only and declare permissions for:

  • ai-train: Training or fine-tuning AI models
  • search: Building search indexes and providing search results
  • ai-input: Inputting content into AI models (RAG, grounding, generative AI search)

How Content Signals Work

Content Signals can be configured at two levels:

  1. User-agent level (default): A single Content-Signal applies to all Allow paths for that user-agent
  2. Path-specific level (advanced): Different Content-Signals for different paths under the same user-agent
User-Agent Level Content Signal

When you configure a single ContentSignal for a user-agent, it applies to all Allow paths:

"Allow": {
  "googlebot": {
    "Paths": ["/blog", "/news"],
    "ContentSignal": {
      "AiTrain": false,
      "Search": true,
      "AiInput": false
    }
  }
}

This generates:

User-agent: googlebot
Content-Signal: /blog ai-train=no, search=yes, ai-input=no
Allow: /blog
Content-Signal: /news ai-train=no, search=yes, ai-input=no
Allow: /news

Each path gets its own Content-Signal directive with the same settings.

Path-Specific Content Signals

For advanced scenarios, you can configure different Content-Signals for different paths under the same user-agent using an array of rules:

"Allow": {
  "bingbot": [
    {
      "Paths": ["/blog", "/news"],
      "ContentSignal": {
        "AiTrain": true,
        "Search": true,
        "AiInput": false
      }
    },
    {
      "Paths": ["/"],
      "ContentSignal": {
        "AiTrain": false,
        "Search": true,
        "AiInput": false
      }
    }
  ]
}

This generates:

User-agent: bingbot
Content-Signal: /blog ai-train=yes, search=yes, ai-input=no
Allow: /blog
Content-Signal: /news ai-train=yes, search=yes, ai-input=no
Allow: /news
Content-Signal: ai-train=no, search=yes, ai-input=no
Allow: /

This allows you to permit AI training on your blog content while restricting it for other areas of your site.

Content Signal Instructions Header

You can optionally include a legal header at the top of your robots.txt that explains the Content Signal terms and conditions. Enable this per ruleset:

"RuleSets": {
  "Production": {
    "IncludeContentSignalInstructions": true,
    "ContentSignal": {
      "AiTrain": false,
      "Search": true,
      "AiInput": false
    },
    "Allow": {
      "*": ["/"]
    }
  }
}

This adds a comprehensive header explaining the Content Signal license terms, including references to EU Directive 2019/790 on copyright. The instructions clarify:

  • What constitutes agreement (yes) and restriction (no)
  • Definitions of search, ai-input, and ai-train
  • Legal basis under EU copyright law

Note: Only enable this if you're using Content Signals in that ruleset, as it adds ~35 lines to the top of your robots.txt.

Configuration within RuleSets

Content Signals are configured within RuleSets and only apply to Allow directives. Disallow rules never include Content-Signal directives.

Default ContentSignal for All Allow Rules

Configure a default ContentSignal at the RuleSet level that applies to all Allow rules:

"RuleSets": {
  "Production": {
    "ContentSignal": {
      "AiTrain": true,
      "Search": true,
      "AiInput": true
    },
    "Allow": {
      "*": ["/"],
      "Googlebot": ["/"]
    },
    "Disallow": {
      "*": ["/admin/"]
    }
  }
}

Both * and Googlebot Allow rules will get the same Content-Signal.

Agent-Specific ContentSignal

Override the default ContentSignal for specific user agents:

"RuleSets": {
  "Production": {
    "IncludeContentSignalInstructions": true,
    "ContentSignal": {
      "AiTrain": true,
      "Search": true,
      "AiInput": true
    },
    "Allow": {
      "OAI-SearchBot": {
        "Paths": ["/"],
        "ContentSignal": {
          "AiTrain": false,
          "Search": true,
          "AiInput": false
        }
      },
      "googlebot": {
        "Paths": ["/blog", "/news"],
        "ContentSignal": {
          "AiTrain": false,
          "Search": true,
          "AiInput": false
        }
      },
      "*": ["/"]
    },
    "Disallow": {
      "*": ["/cdn-cgi/"]
    }
  }
}

This generates:

# As a condition of accessing this website, you agree to abide by
# the following content signals:
# ... (legal header text) ...

User-agent: googlebot
Content-Signal: /blog ai-train=no, search=yes, ai-input=no
Allow: /blog
Content-Signal: /news ai-train=no, search=yes, ai-input=no
Allow: /news

User-agent: OAI-SearchBot
Content-Signal: ai-train=no, search=yes, ai-input=no
Allow: /

User-agent: *
Content-Signal: ai-train=yes, search=yes, ai-input=yes
Allow: /

User-agent: *
Disallow: /cdn-cgi/

Notice:

  • Specific user agents (googlebot, OAI-SearchBot) appear before the wildcard *
  • Each user-agent gets its own ContentSignal - googlebot and OAI-SearchBot have restricted permissions, while * allows everything
  • The legal header is included because IncludeContentSignalInstructions: true
  • Each path gets its own Content-Signal directive paired with its Allow directive
Simple and Complex Allow Rules

You can mix simple array format and complex object format in the same RuleSet:

  • Simple format: "UserAgent": ["/path1", "/path2"] - Uses default ContentSignal from RuleSet
  • Complex format: "UserAgent": { "Paths": [...], "ContentSignal": {...} } - Uses agent-specific ContentSignal
"RuleSets": {
  "Production": {
    "ContentSignal": {
      "AiTrain": true,
      "Search": true,
      "AiInput": true
    },
    "Allow": {
      "Googlebot": ["/"],  // Simple - uses default ContentSignal (ai-train=yes)
      "OAI-SearchBot": {  // Complex - overrides with ai-train=no
        "Paths": ["/"],
        "ContentSignal": {
          "AiTrain": false,
          "Search": true,
          "AiInput": false
        }
      }
    }
  }
}

This lets you set a permissive default for most bots while restricting specific ones like AI search crawlers.

User-Agent Ordering

Specific user agents are always ordered alphabetically and appear before the wildcard *. This follows robots.txt best practices where more specific rules should be evaluated before general rules.

"Allow": {
  "OAI-SearchBot": ["/"],
  "googlebot": ["/blog"],
  "*": ["/"]
}

Will always output in this order:

User-agent: googlebot
...

User-agent: OAI-SearchBot
...

User-agent: *
...

Common Policies

Allow Search Only (no AI):

"ContentSignal": {
  "AiTrain": false,
  "Search": true,
  "AiInput": false
}

Allow Search & AI Input (no training):

"ContentSignal": {
  "AiTrain": false,
  "Search": true,
  "AiInput": true
}

Allow All:

"ContentSignal": {
  "AiTrain": true,
  "Search": true,
  "AiInput": true
}

Disallow All (most restrictive):

"ContentSignal": {
  "AiTrain": false,
  "Search": false,
  "AiInput": false
}

Crawl-delay

Control how frequently crawlers can request pages from your site on a per-user-agent basis. The Crawl-delay directive requests crawlers to wait a specified number of seconds between successive requests.

Configuration:

Crawl-delay is configured per user-agent in the Allow rules using the complex format:

"RuleSets": {
  "Production": {
    "Allow": {
      "Googlebot": {
        "Paths": ["/"],
        "CrawlDelay": 10
      },
      "Bingbot": {
        "Paths": ["/"],
        "CrawlDelay": 5
      }
    }
  }
}

This generates:

User-agent: Bingbot
Crawl-delay: 5
Allow: /

User-agent: Googlebot
Crawl-delay: 10
Allow: /

Notes:

  • Crawl-delay is specified in seconds (integer)
  • Only available in the complex Allow rule format (not simple string array)
  • If a user-agent appears in both Allow and Disallow with different Crawl-delay values, Allow takes precedence
  • Not all crawlers respect Crawl-delay (Google and Bing use their own rate limiting via Search Console/Webmaster Tools)
  • Typical values: 1-10 seconds for busy sites, 0.5-2 seconds for moderate traffic

Use cases:

  • Protect server resources during peak traffic
  • Slow down aggressive crawlers
  • Different rates for different bots (e.g., slower for less important crawlers)

Cache Control

Control how long browsers and crawlers should cache the robots.txt file using the MaxAge property. This sets the Cache-Control: max-age header in the HTTP response.

"Crumpled": {
  "RobotsTxt": {
    "MaxAge": "1.00:00:00"  // 1 day (default)
  }
}

TimeSpan Format Examples:

  • "1.00:00:00" - 1 day (default)
  • "12:00:00" - 12 hours
  • "00:30:00" - 30 minutes
  • "7.00:00:00" - 7 days

Default: 1 day (86,400 seconds)

Why it matters:

  • robots.txt doesn't change frequently, so longer cache times reduce server load
  • Search engines and crawlers respect cache headers to minimize repeated requests
  • Shorter cache times allow faster propagation of rule changes if needed

Recommended values:

  • Production sites: 1.00:00:00 to 7.00:00:00 (1-7 days) - rules rarely change
  • Active development: 00:30:00 to 01:00:00 (30 minutes - 1 hour) - faster updates during testing
Product Compatible and additional computed target framework versions.
.NET net8.0 is compatible.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed.  net9.0 was computed.  net9.0-android was computed.  net9.0-browser was computed.  net9.0-ios was computed.  net9.0-maccatalyst was computed.  net9.0-macos was computed.  net9.0-tvos was computed.  net9.0-windows was computed.  net10.0 was computed.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last Updated
3.1.0-beta.1 36 3/18/2026
3.0.3 253 2/14/2026
3.0.2 116 2/12/2026
3.0.1 90 2/12/2026
3.0.0 96 2/12/2026