A new class of crawler appeared between 2023 and 2024: AI systems collecting web content for training, retrieval, and answer generation. They’re not Googlebot, they don’t serve traditional search results, and the decision about whether to allow or block them is genuinely strategic. The defaults that worked for search engine crawlers don’t translate cleanly.
GPTBot from OpenAI, ClaudeBot from Anthropic, PerplexityBot from Perplexity, Google-Extended for Gemini, CCBot from Common Crawl, and a growing list of others now request content from most major sites. Each has documented user agent strings, documented purposes, and documented opt-out mechanisms. The choices about how to handle them affect both how the site appears in AI-generated answers and how the site’s content gets used in model training.
The crawlers and what they do:
The major AI crawlers as of mid-2026, with their stated purposes:
| Crawler | Operator | Purpose | User agent |
|---|---|---|---|
| <strong>GPTBot</strong> | OpenAI | Training data for future models | <!–INLINECODE0–> |
| <strong>OAI-SearchBot</strong> | OpenAI | Real-time fetching for ChatGPT Search | <!–INLINECODE1–> |
| <strong>ChatGPT-User</strong> | OpenAI | On-demand fetches when a user shares a URL | <!–INLINECODE2–> |
| <strong>ClaudeBot</strong> | Anthropic | Training data for Claude models | <!–INLINECODE3–> |
| <strong>Claude-User</strong> | Anthropic | On-demand fetches in Claude conversations | <!–INLINECODE4–> |
| <strong>Claude-SearchBot</strong> | Anthropic | Real-time search for Claude | <!–INLINECODE5–> |
| <strong>PerplexityBot</strong> | Perplexity | Crawling for Perplexity's search index | <!–INLINECODE6–> |
| <strong>Perplexity-User</strong> | Perplexity | On-demand fetches for user queries | <!–INLINECODE7–> |
| <strong>Google-Extended</strong> | Training data for Gemini and Vertex AI | (controlled in robots.txt with the token <!–INLINECODE8–>) | |
| <strong>CCBot</strong> | Common Crawl | Open dataset used by many AI projects | <!–INLINECODE9–> |
| <strong>Bytespider</strong> | ByteDance | Training data for ByteDance models | <!–INLINECODE10–> |
| <strong>Meta-ExternalAgent</strong> | Meta | Training data for Meta AI | <!–INLINECODE11–> |
The distinction between training crawlers and real-time fetch agents matters. Training crawlers (GPTBot, ClaudeBot, Google-Extended, CCBot) collect content into datasets that train future models. Real-time fetch agents (OAI-SearchBot, Claude-SearchBot, PerplexityBot for indexing, ChatGPT-User and Claude-User for on-demand) fetch content to answer specific user queries.
The opt-out decision can be different for each category. A site can allow real-time fetching (so its content can be cited in AI answers) while blocking training crawlers (so its content doesn’t end up in next-generation models).
How the opt-out works:
All major AI crawlers respect robots.txt. The standard pattern:
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
Each crawler is named explicitly. The User-agent: * wildcard doesn’t reliably catch AI crawlers, because most check for their specific user agent string first. Sites that want to block all AI crawling need to enumerate each one.
Some crawlers support additional opt-out signals:
- OpenAI publishes its IP ranges so blocks can be enforced at the firewall level beyond robots.txt
- Anthropic documents that ClaudeBot respects robots.txt and uses identifiable user agent strings
- Google-Extended is a robots.txt token, not a crawler with its own IP range; it’s used to opt out of Gemini training while keeping Googlebot crawling for Search
- Common Crawl respects robots.txt and provides historical opt-out (removes the site from previously collected datasets on request, though enforcement varies)
The opt-out timing varies. Adding a Disallow rule takes effect within days for most crawlers. Removing already-collected training data is harder; some operators honor removal requests, others don’t.
Emerging standards beyond robots.txt:
Robots.txt was written for search engine indexing in 1994. It carries assumptions that don’t fit AI: a single binary signal (allow or disallow) with no granularity over use case, no expiration, no machine-readable license terms, and no enforcement mechanism beyond crawler compliance. Several proposals have emerged to address the gap, with adoption uneven.
- TDM (Text and Data Mining) Reservation Protocol. A signal embedded in HTTP response headers (
tdm-reservation: 1) or HTML metadata that asserts a site’s content is reserved from text and data mining for AI training. The protocol comes from the European Union’s Copyright Directive Article 4, which gives rightsholders the right to opt out of TDM exemptions. Crawler compliance with TDM signals is inconsistent; the legal weight in EU jurisdictions is stronger than the technical enforcement. - llms.txt. A proposed convention (announced by Answer.AI’s Jeremy Howard in 2024) for sites to publish a markdown-formatted summary specifically for LLM consumption. The format is the inverse of robots.txt; instead of blocking, it provides a curated entry point with the site’s most LLM-relevant content. Adoption is still early; some major documentation sites (Anthropic, Vercel, Cloudflare) have published llms.txt files. The protocol assumes that AI companies want to be told where the high-value content is, which may or may not match how their crawlers actually behave.
- ai.txt. A Spawning Project initiative proposing a separate file from robots.txt with finer-grained AI-specific directives. Less adopted than the alternatives; the practical effect is similar to using robots.txt with AI crawler user agents.
- Meta tag directives (noai, noimageai). Page-level signals embedded in HTML (
<meta name="robots" content="noai">) intended to prevent AI training on a specific page. Adoption by AI crawlers is limited; major training crawlers do not consistently honor these tags. - C2PA (Coalition for Content Provenance and Authenticity). Not an opt-out standard but a content authentication framework. Embeds cryptographic signatures in media files indicating origin and edit history. Adobe, Microsoft, BBC, and others participate. The standard addresses the inverse problem: rather than blocking AI from using content, it lets content prove its provenance after AI models generate or alter it.
None of these standards has achieved universal compliance. For now, robots.txt remains the most enforceable signal, supplemented by IP-based blocking at the CDN layer when crawler operators don’t honor declared opt-outs. Sites with serious licensing concerns combine robots.txt with TDM Reservation headers and CDN-layer enforcement; the layered approach increases the probability that at least one signal is honored.
The decision framework:
The question of whether to allow or block AI crawlers is genuinely strategic. Three considerations dominate:
Content licensing and rights. Sites with paywalled content, premium databases, or licensed content often block AI training crawlers to prevent their content from training models that could compete with their offerings. News publishers have been particularly active here, with major outlets either blocking AI crawlers entirely or negotiating licensing deals with AI companies.
AI answer visibility. Sites that want to appear in AI-generated answers benefit from allowing the real-time fetch agents (OAI-SearchBot, Claude-SearchBot, PerplexityBot for search indexing). Blocking these means the site won’t appear as a source when users ask AI systems questions in the site’s content domain.
Training data philosophy. Sites that want their content to be used in training future AI models (often for ideological reasons, or because they believe widespread use increases their brand visibility) allow training crawlers. Sites that view training crawlers as freeloading on content production (using the content to build competing products without compensation) block them.
The decisions don’t have to be uniform. A common pattern is allowing real-time fetch agents (visibility upside) while blocking training crawlers (no compensation, competing risk).
The legal landscape and what robots.txt actually means:
Robots.txt was never a legal mechanism. It’s a voluntary protocol that requests crawler compliance; enforcement depends on each operator’s policy and good faith. The legal question of whether AI crawlers can use web content for training without consent or compensation is being litigated in real time as of 2026, with no settled answer.
The major active cases set the contours of the dispute:
- New York Times v. OpenAI and Microsoft (filed December 2023). The Times alleges OpenAI and Microsoft used Times content without authorization to train ChatGPT and Bing’s AI features, and that the resulting models can reproduce Times content verbatim in responses. The case is ongoing; key fair use arguments and damages calculations remain unresolved.
- Authors Guild v. OpenAI and similar author class actions. Multiple author groups have filed against OpenAI, Anthropic, Meta, and others alleging unauthorized training on copyrighted books. Sarah Silverman, George R.R. Martin, and Jonathan Franzen are among the named plaintiffs across various cases.
- Stack Overflow’s terms of service updates (2024). After ChatGPT trained on Stack Overflow content, the platform updated its terms to restrict AI training and pursued partnership-based licensing instead of unilateral access.
- Getty Images v. Stability AI. Image-specific but precedent-relevant; Getty alleges Stable Diffusion was trained on Getty images including watermarked content.
For most sites, robots.txt remains the practical compliance signal even when its legal weight is uncertain. A site that has clearly disallowed an AI crawler in robots.txt has a stronger position in any subsequent dispute than a site that has signaled nothing. Some legal scholars have argued that robots.txt disallowance creates an implied license restriction that strengthens copyright claims, though the argument has not been definitively tested.
The practical implications for site owners:
- Robots.txt is the floor, not the ceiling. Strong opt-out documentation (robots.txt, TDM headers, terms of service updates) builds a defensible position even if individual crawlers ignore the signals.
- Logging is evidence. Server logs showing AI crawler activity after explicit blocks become potential evidence in disputes. Retention policies should preserve at least 90 days of crawler activity.
- Jurisdiction matters. EU sites have stronger legal footing under the Digital Single Market Directive than US sites under fair use doctrine. The same crawler activity can have different legal status depending on where the content was published.
- Licensing deals are emerging. OpenAI has signed content licensing agreements with Axel Springer, News Corp, the Financial Times, Vox Media, the Associated Press, Time, the Atlantic, and Reuters, among others. Google has similar arrangements with publishers like the New York Times for AI features. The terms vary significantly; some include training rights, others restrict to real-time retrieval only.
The legal landscape will continue to evolve through 2026 and beyond. Site owners with significant copyright concerns should treat AI crawler management as an ongoing policy question with input from legal counsel, not a one-time robots.txt configuration.
What each decision looks like in robots.txt:
The configurations that produce different strategic outcomes:
Maximum permissiveness (allow all AI):
# No specific AI crawler blocks; default User-agent: * rules apply
User-agent: *
Allow: /
Visibility-focused (allow real-time fetch, block training):
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
Maximum protection (block all AI):
User-agent: GPTBot
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-User
Disallow: /
User-agent: Claude-SearchBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Perplexity-User
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
The visibility-focused configuration has become the most common for sites whose primary concern is being cited in AI answers without contributing to training datasets that may compete with the site’s content business.
How AI Overviews and retrieval systems intersect with crawler permissions:
Crawler permissions don’t operate in isolation; they interact with how AI systems retrieve and display content. The mechanics differ across systems, and the implications for site visibility differ accordingly.
Google AI Overviews. Powered by Gemini, the feature draws from Google’s standard Search index (not from Google-Extended). A site that allows Googlebot but blocks Google-Extended still appears in AI Overviews because the index is shared. The Google-Extended token controls whether content can be used to train Gemini and Vertex AI models, not whether it can appear in AI-generated answer summaries. This separation matters because it means visibility in AI Overviews and training opt-out are independent decisions.
Bing Copilot and AI search. Powered by GPT models with retrieval from Bingbot’s index. Sites allowed by Bingbot can appear in Copilot’s cited sources. The OAI-SearchBot user agent is a separate fetch mechanism OpenAI uses for ChatGPT Search; allowing or blocking it controls visibility in that interface specifically.
ChatGPT Search and Perplexity. Both use real-time fetch (on-demand retrieval) when users ask questions, with the answer-generation model citing the fetched sources. ChatGPT-User and PerplexityBot fetch content at query time; blocking them excludes the site from those answer interfaces but doesn’t affect training datasets.
Anthropic Claude with web search. Claude-User fetches content when users ask questions that require current information. Like ChatGPT-User, the fetch is on-demand. Claude-SearchBot is the indexing crawler for Claude’s broader retrieval capabilities.
The pattern across systems: training crawlers (which build datasets) and retrieval crawlers (which fetch on demand for answers) are increasingly separate. Sites that want maximum AI answer visibility while protecting training rights need to identify which user agent controls which behavior for each AI system. The configuration table at the start of this article catalogs the major user agents and their purposes; the practical implication is that blanket “block all AI” reduces visibility in AI answer interfaces, not just training datasets.
RAG (retrieval-augmented generation) is the underlying architecture for most current AI answer systems. The AI model retrieves relevant content at query time, then generates an answer that synthesizes the retrieved sources. The implication for SEO: content that’s well-structured, semantically clear, and indexable by the retrieval crawler is more likely to be cited in AI answers. The same content quality signals that produce good traditional SEO outcomes also produce good RAG visibility outcomes, with crawler permissions as the gating factor.
Enforcement and edge cases:
Robots.txt is a request, not a technical enforcement mechanism. Most major AI crawlers honor it, but some don’t, and some grey-zone fetches happen through methods that aren’t easily controlled:
- Common Crawl is collected with permission via robots.txt, but the resulting dataset is widely redistributed and used by AI projects that may not check robots.txt independently
- Some smaller AI projects use commercial scraping services that don’t always respect robots.txt
- User-initiated fetches (someone pasting a URL into ChatGPT or Claude) trigger different agents that have different opt-out semantics
For sites with strong protection requirements, robots.txt alone isn’t enough. Additional layers:
- Cloudflare’s AI bot blocking (paid tier) enforces blocks at the CDN edge for known AI crawlers
- Firewall rules based on IP ranges (where published) block at the infrastructure level
- Paywalls that require authentication block all unauthenticated crawling, including AI
- HTML signals (X-Robots-Tag headers with noai or noimageai) are emerging as additional opt-out mechanisms, though support varies
User-agent verification: when declared identity isn’t enough:
Most enforcement discussion assumes crawlers identify themselves honestly through user agent strings. In practice, sophisticated scrapers spoof identifiable user agents to bypass blocks, and some less reputable AI services have been documented operating without declared user agents at all. Sites with serious content protection concerns verify identity rather than trusting declarations.
Reverse DNS verification. Major crawlers operate from known IP ranges with specific reverse DNS patterns. Googlebot resolves to *.googlebot.com or *.google.com. Bingbot resolves to *.search.msn.com. Verification involves a reverse DNS lookup on the requesting IP, then a forward DNS lookup on the returned hostname to confirm it matches the IP. A request claiming to be Googlebot from an IP that doesn’t pass this round-trip check is spoofed.
Published IP range checks. Google publishes its Googlebot IP ranges in a JSON file at developers.google.com/search/apis/ipranges/googlebot.json. Bing publishes similar data. For Googlebot, the JSON file is the authoritative source; comparing the requesting IP against the published ranges confirms authenticity. Some AI crawlers (OpenAI for GPTBot, Anthropic for ClaudeBot) have begun publishing IP ranges similarly, though the practice is less mature than for traditional search engines.
Behavioral fingerprinting. Real crawlers show patterns: consistent request rates, predictable user agent versioning, distinct header signatures. Spoofed traffic often deviates: bursty patterns, inconsistent versions across requests, missing or unusual headers. Bot management platforms (Cloudflare Bot Management, Akamai Bot Manager, DataDome) use behavioral fingerprinting to identify spoofed traffic that passes user-agent and IP checks.
Rate limiting as defense. Even verified crawlers can be limited to acceptable request rates. A real GPTBot operating within published norms is fine; a verified crawler suddenly making thousands of requests per minute is anomalous regardless of identity. Rate limits at the CDN or origin level catch abusive behavior from both legitimate and spoofed sources.
The discipline that works for sites with serious protection requirements: combine user-agent declarations (the first signal) with IP verification (the second signal), reverse DNS (the third), and behavioral analysis (the fourth). No single signal is sufficient; the layered approach catches sophisticated spoofing that individual checks would miss.
Operational monitoring for AI crawler activity:
The decision to allow or block AI crawlers is operational; the visibility into whether the configuration works is monitoring. Most sites discover compliance failures (or unexpected crawler behavior) months later when reviewing logs. The pattern that catches issues sooner involves explicit monitoring infrastructure.
Log aggregation workflows. Server logs are the primary data source for AI crawler activity. Sites at scale aggregate logs across origins, CDN edges, and load balancers into a central analysis system. Datadog, Splunk, and Elastic are common platforms; smaller sites can use Cloudflare’s Logpush, AWS CloudWatch Logs, or self-hosted ELK stacks. The goal is a queryable view of all crawler activity across the site, not isolated logs per server.
Bot management tooling. Beyond raw logs, bot management platforms (Cloudflare Bot Management, DataDome, PerimeterX, Akamai Bot Manager) provide pre-categorized views of crawler activity with verified vs. spoofed flags. The same platforms enforce blocks at the edge, so they serve as both monitoring and enforcement infrastructure. The tradeoff is cost; enterprise bot management is rarely free.
CDN analytics. Cloudflare, Fastly, and Akamai all expose crawler-level analytics in their dashboards. The granularity is useful for high-level monitoring (which AI crawlers are most active, which paths they target, what response codes they receive) without requiring custom log analysis. The limitation is that CDN analytics only show traffic that hits the CDN edge; direct-to-origin traffic isn’t visible.
SIEM integrations. For sites with security operations centers, AI crawler activity feeds into SIEM platforms (Splunk Enterprise Security, IBM QRadar, Microsoft Sentinel) alongside other security signals. The integration matters when crawler behavior overlaps with security concerns: aggressive scraping, content exfiltration patterns, or suspected industrial espionage. Most SIEMs accept Cloudflare and Fastly log feeds natively.
Crawl anomaly detection. Beyond compliance monitoring, sites benefit from anomaly detection on crawler patterns. Sudden spikes in a previously low-volume crawler, new user agents appearing in logs, unusual path access patterns (deep crawling of premium content, repeated fetches of high-value pages) all warrant investigation. Most modern log analysis platforms include baseline anomaly detection; for sites without dedicated tooling, weekly review of the top 20 crawlers by request volume catches most issues.
The minimum monitoring discipline for a site that takes AI crawler policy seriously: weekly review of crawler activity, monthly review of robots.txt compliance against actual log data, quarterly review of which AI crawlers are emerging and whether configuration needs updates. The cadence catches drift before it compounds.
The measurement problem:
Knowing whether AI crawlers are honoring opt-outs requires log analysis. Server logs show which user agents requested content; comparing to robots.txt rules reveals compliance.
The patterns to look for:
- Disallowed crawlers still requesting content. Real Googlebot would respect the rule; some other crawlers don’t. Verify the user agent isn’t spoofed by checking IP ranges where published.
- Crawl rate from AI crawlers. Sites that allow AI crawling can see significant volume. GPTBot, ClaudeBot, and others combined can exceed Googlebot’s crawl volume on some sites.
- Cited but not fetched. Some AI systems cite sites without obvious crawl traffic, which can mean the citation comes from training data (already collected) rather than real-time fetching.
The measurement is imperfect. AI crawler behavior is less standardized than search engine crawler behavior, and the standards are evolving.
Strategic positioning, not technical detail:
AI crawler management is a real decision, not a technical detail. The decision affects how the site shows up in AI-generated answers, how the site’s content is used in training, and how the site is positioned for the next generation of search and discovery.
The default of allowing everything was the right answer when search engines were the only major crawlers and visibility in search was the primary goal. The default of blocking everything is overly defensive for most sites. The middle path (visibility-focused, with selective blocking) reflects the current state of the trade-off.
The decision should be revisited periodically. The AI crawler landscape changes (new crawlers appear, existing ones change behavior, AI companies announce new policies), and the decisions that made sense in 2024 may need updating in 2026 and beyond.
The fundamentals remain: robots.txt is the primary control, log analysis is the primary measurement, and the decision is about strategic positioning rather than just technical configuration.