Implementation · GEO · 10 min read · last updated 2026-06-08
Should you block AI crawlers? A robots.txt decision guide for GPTBot, ClaudeBot, and PerplexityBot
The training-versus-grounding distinction that should drive the decision, the bots that actually matter, and the pragmatic stance most publishers land on
"Should we block the AI crawlers?" is the most-asked and most-emotional infrastructure question in AI search, and most answers skip the distinction that actually decides it. There is a difference between the bots that crawl to build training corpora and the bots that crawl in real time to ground answers, and blocking the wrong one quietly removes you from the engines your buyers use.
The distinction that decides it
AI crawlers fetch content for two different purposes:
- Training crawl. Offline, batch collection to build the corpus a model learns from. Blocking this opts you out of training-corpus inclusion, which is mostly aspirational for publishers anyway: the major labs publish little about corpus selection and the practical visibility benefit is unclear.
- Grounding crawl. Live, per-query retrieval that powers answer engines. Blocking this removes you from real-time grounding, which is the durable AI-visibility surface. Block the grounding bot and you cannot be cited by that engine, full stop.
Almost every bad robots.txt decision comes from treating these as one thing. You can opt out of training while staying open to grounding. Conflating them is how publishers accidentally make themselves invisible in ChatGPT or Perplexity while trying to protect their content from training.
The bots, and what blocking each one costs
| User agent | Operator | Purpose | Cost of blocking |
|---|---|---|---|
| GPTBot | OpenAI | Training | Out of OpenAI training; little live-visibility cost |
| OAI-SearchBot | OpenAI | Grounding | Out of ChatGPT Search citations |
| ChatGPT-User | OpenAI | User-initiated fetch | Breaks user browse/plugin fetches |
| ClaudeBot | Anthropic | Training | Out of Anthropic training |
| Claude-Web | Anthropic | Grounding | Out of Claude web-search citations |
| PerplexityBot | Perplexity | Grounding | Out of Perplexity citations |
| Google-Extended | Gemini training opt-out | Out of Gemini training; does not affect Search | |
| Googlebot | Search + AI Overviews | Out of Google entirely (do not block) | |
| Bingbot | Microsoft | Bing + Copilot | Out of Bing and Copilot |
| CCBot | Common Crawl | Third-party training | Out of many downstream training sets |
The case for blocking
It is a real position, not a fringe one. The publisher-economics argument is that AI engines consume your content to produce answers that reduce your traffic, so blocking is leverage: deny the corpus, or hold out for a licensing deal. Publishers with strong content-licensing prospects, and sites in sensitive or legally-exposed categories, have defensible reasons to block training crawlers specifically. The AI content crisis piece on guptadeepak.com covers the economic settlement playing out here.
The case for staying open
For most publishers, real-time grounding is the durable AI-visibility surface, and the grounding bots are the ones to keep open. Blocking PerplexityBot or OAI-SearchBot does not protect anything valuable; it just makes you uncitable in those engines while your competitors get the citation. Training-corpus inclusion, the thing blocking actually prevents, is mostly aspirational anyway. For a site whose goal is to be visible in AI search, blocking grounding bots is self-defeating.
The pragmatic middle
The stance most AEO and GEO programs converge on:
- Open the grounding bots. OAI-SearchBot, Claude-Web, PerplexityBot, Bingbot, Googlebot. These are your AI-search visibility.
- Opt out of training where it is honored and where you care. Google-Extended, GPTBot, ClaudeBot, CCBot, if training inclusion is a concern. This costs you little visibility.
- Manage abuse at the edge, not in robots.txt. robots.txt is honor-system; well-behaved bots respect it and misbehaving ones ignore it. Rate-limiting and bot management at the CDN (Cloudflare's AI bot controls, for one) is where you actually enforce policy.
How to implement it
A robots.txt that stays open to grounding while opting out of training looks roughly like this:
# Allow grounding / real-time retrieval (AI search visibility)
User-agent: OAI-SearchBot
Allow: /
User-agent: Claude-Web
Allow: /
User-agent: PerplexityBot
Allow: /
# Opt out of training corpora
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /Pair it with an llms.txt so the grounding bots you welcome get a clean map of your best content, and remember that robots.txt is a request, not a wall. Enforcement lives at the edge.
The economics are unsettled
The publisher-versus-LLM economic relationship is still being negotiated in public: licensing deals, opt-out standards, and litigation are all in motion, and the right answer may shift as that settles. The pragmatic stance above is robust to most outcomes: stay visible where visibility is the durable asset, opt out of the corpus where you have reason to, and revisit as the settlement evolves. If your traffic is already down, do not reach for the block as a fix; the AI Overviews recovery guide covers why that usually makes things worse.
Related guides
Further reading on guptadeepak.com