Deepak Gupta

Glossary · last updated 2026-05-27

AI crawler

Also known as: AI bot, LLM crawler, GPTBot, ClaudeBot, PerplexityBot

A web crawler operated by an AI engine or LLM provider, used to build training corpora and retrieval indexes. The bots whose access controls determine whether your content is reachable to AI engines at all.

AI crawlers fetch web content for two distinct purposes: building training corpora for LLMs (offline, batch) and powering real-time grounding for answer engines (live, per-query). Some user agents do both; some do one or the other. The major user agents publishers track in 2026:

GPTBot (OpenAI): training corpus.
OAI-SearchBot (OpenAI): real-time grounding for ChatGPT Search.
ChatGPT-User (OpenAI): user-initiated fetches via ChatGPT plugins / browse.
ClaudeBot (Anthropic): training corpus.
Claude-Web (Anthropic): real-time grounding for Claude with web search.
PerplexityBot (Perplexity): real-time grounding.
Google-Extended (Google): opt-out signal for Bard / Gemini training, separate from Googlebot.
Bingbot (Microsoft): both classical and AI search via Bing Copilot.
CCBot (Common Crawl): third-party crawl used by many LLM training corpora.

Robots.txt controls per-bot access. Blocking a bot prevents training-corpus inclusion (which is mostly aspirational anyway) but also typically prevents real-time grounding by that engine, a meaningful AI-visibility cost. Most AEO/GEO programmes in 2026 allow all major crawlers and use Cloudflare or similar rate-limiting at the edge to manage abuse without losing visibility.

The economics of opening vs blocking AI crawlers are unsettled and contested. The pragmatic stance: open to the grounding bots (real-time retrieval is the durable AI-visibility surface), opt out of training where the provider honours the signal, and watch the publisher-LLM economic settlement evolve.