Skip to content
By GEO

Crawl Budget Is Now an AI Visibility Problem

Crawl budget is no longer a technical SEO footnote. Google's AI surfaces draw from the same Search index as the blue links, so whether you appear in an AI answer reduces to whether Google crawled, indexed, and kept your pages fresh.

Crawl Budget Is Now an AI Visibility Problem, by Deepak Gupta on guptadeepak.com

Most people treat crawl budget as a technical SEO footnote. That framing is out of date. Google's AI surfaces, AI Overviews, AI Mode, and Gemini grounding, all retrieve from the same Search index that powers the blue links. So the question of whether your content shows up in an AI answer collapses into a much older question: did Google crawl it, index it, and keep it fresh? Crawl budget is what decides all three.

If Google cannot crawl a page, it cannot index it. If it cannot index it, no AI surface Google operates can ever cite it. That is the whole argument. Everything below is the detail.

This is the argument. If you want the documentation, I wrote the full source-by-source technical analysis, grounded entirely in Google's official docs, as a research report: Crawl Budget and AI Search Visibility. Read this for the thesis, that for the receipts.

There is no separate AI index

This is the fact that resets the conversation. Google's generative features do not run on a private AI track. AI Overviews use retrieval-augmented generation, what Google calls grounding, to pull content from the Search index into the model at answer time. AI Mode does the same through a process Google calls query fan-out. Both reach into the index you already compete in.

Google's documentation is direct about the eligibility test. To be shown as a supporting link in AI Overviews or AI Mode, a page must be indexed and eligible to appear in Google Search with a snippet. There is no extra requirement. There is no llms.txt shortcut. There is no special schema that buys entry. The bar for AI visibility is the bar for Search, and crawl budget decides which of your pages clear it.

What crawl budget actually is

Two forces, not a fixed number. Google does not hand you a crawl allowance, and you cannot request more directly.

Crawl capacity is what Google can fetch without straining your server: parallel connections and the delay between them. It rises when your site responds fast and clean. It falls when you serve slow pages or errors. A spike in 5xx errors is the fastest way to lose it, and recovery takes days or weeks, not minutes.

Crawl demand is what Google wants to fetch, driven by three things Google names: perceived inventory (every URL it knows about, which is why junk URLs hurt), popularity, and staleness. Google calls perceived inventory the factor you control most.

Your effective budget is whichever of the two runs out first. If your server is fast but your content is thin and never updated, Google stays away. If demand is high but your server buckles, Google throttles. Per Google's crawl budget guidance, there are exactly two ways to increase the budget: add server resources if you are hitting Hostload exceeded, or improve content quality, which Google scores on popularity, user value, content uniqueness, and serving capacity.

One caveat worth saying out loud: if your site has under a few thousand URLs, or new pages get crawled the day you publish them, crawl budget is not your problem. This is a large-site and fast-publishing-site issue. For everyone in that group, it is now also an AI visibility issue.

The chain that decides everything

Read this as a chain where each step is a precondition for the next.

  1. Crawl. Googlebot fetches the URL. Crawl budget decides whether and how often.
  2. Index. Google processes and stores the page. A page can be crawled and never indexed. These are two separate steps, and Google says so.
  3. Snippet eligibility. The page qualifies to appear in Search with a snippet.
  4. AI retrieval. AI Overviews and AI Mode pull from the index through grounding. Not in the index means not in the candidate pool.
  5. AI citation. The page gets selected as a source in the synthesized answer. Quality and relevance decide here.

Crawl budget owns steps one and two. Lose the crawl-to-index battle and steps three through five never happen. No content quality, no structured data, no clever tactic rescues a page Google never indexed. The AI visibility problem on large sites is a crawl and indexation problem wearing a new name.

Why query fan-out raised the stakes

AI Mode does not run one query. It breaks your question into subtopics and fires many queries at once, then assembles an answer from multiple sources. Google says this lets Search dig deeper into the web than a single traditional query.

For a site owner, that changes the math. One user question can now touch dozens of your pages across related sub-queries. Every one of those pages has to be indexed and current to be eligible. A crawl budget problem that used to cost you a ranking or two now costs you presence across an entire answer. Fan-out multiplies the number of your pages in play, which multiplies the cost of every page crawl budget left behind. This is also why GEO has to be earned across a whole vertical of pages, not optimized one URL at a time.

Where the budget leaks

Every wasted crawl is a crawl your strategic pages did not get. The documented leaks:

Faceted navigation and URL parameters. Filters, sorts, and session IDs spawn a near-infinite set of duplicate URLs. Google calls this a classic crawler trap. Googlebot keeps re-comparing similar pages instead of reaching the pages you care about. This is the biggest single leak on ecommerce and catalog sites.

Duplicate and thin content. Crawl gets spent on unique URLs instead of unique content. Empty category pages and auto-generated filler consume crawl and give the index nothing an AI system would want.

Redirect chains and error pages. Long redirect chains have a documented negative effect on crawling. Every hop burns capacity. Soft 404s waste fetches on nothing.

Crawled but not indexed. The silent killer. A page passes crawl, fails to enter the index, and becomes invisible to AI retrieval by definition. The only way to catch it is to read the Crawl Stats and Page Indexing reports together.

What to actually do

Google's position is blunt: optimizing for AI features is optimizing for Search. There is no separate AI checklist. The work splits in two. Technical efficiency protects your budget. Content quality earns selection once you are eligible. Do them in that order.

Protect the budget.

  • Cut the junk inventory first. It is the factor you control most.
  • Consolidate duplicates with canonical tags so crawl focuses on unique content, not unique URLs.
  • Block faceted and infinite-parameter URLs in robots.txt. Disallowed URLs do not count against your budget.
  • Keep sitemaps clean: only indexable, canonical, 200-status URLs. A sitemap is a roadmap, not a junk drawer.
  • Fix server health. Kill 5xx errors, reduce response times, use a CDN.
  • Flatten architecture and strengthen internal links. A page linked from the homepage outranks one buried five levels deep for crawl priority.
  • Collapse redirect chains to a single hop.
  • Support Last-Modified and If-Modified-Since so Google recrawls where content actually changed.

Earn the citation.

  • Lead with a genuine point of view. AI systems scan many sources, so first-hand perspective stands out where a summary does not. Google says this likely matters more than anything else on the list, and it is why I argue GEO is a product discipline, not a content tactic.
  • Write helpful, people-first content. E-E-A-T applies directly to what the AI system prefers. The practical version of that is in the AEO strategy playbook.
  • Use clear structure with real headings. No need to fragment your content. Google understands multiple topics on one page.
  • Add relevant images and video. AI search surfaces media, which is extra room to appear.
  • Use structured data to reinforce what is visible on the page, not a machine-only version.

What to ignore

Google is just as clear about wasted effort, and most of the popular AI optimization advice lands here.

You do not need an llms.txt file. Google Search does not use it. You do not need to chunk your content into tiny pieces. Google understands a full page. You do not need special AI schema, an AI writing style, or robotic Q-and-A blocks stuffed onto every page. There is no public evidence any of it creates durable visibility.

Here is the trap. Most of those discarded tactics manufacture new low-value URLs or fragments. That means the trendy AI optimization advice can actively damage your AI visibility by creating exactly the crawl waste that buries your real pages. The defensive move and the offensive move are the same move: keep the index clean.

One more lever: Google-Extended

Worth knowing because people break it by accident. Google-Extended is a separate robots.txt token that controls whether your content trains future Gemini models and grounds Gemini answers in Gemini Apps and Vertex AI. It does not affect Google Search inclusion or ranking at all.

Two surfaces, two controls. AI Overviews and AI Mode follow Googlebot's robots.txt rules, because AI is built into Search. Gemini's grounded answers follow Google-Extended. Block Googlebot and you vanish from Search and its AI features. Block Google-Extended and you stay in Search but drop out of Gemini grounding. Confusing the two, or disallowing Googlebot when you meant Google-Extended, is a common and expensive mistake. If presence in Gemini answers matters to you, keep Google-Extended allowed.

The bottom line

Google did not build a new door for AI. It widened the one you already use. AI Overviews, AI Mode, and Gemini grounding all reach into the same index, so whether you appear in an AI answer reduces to whether your content is crawled, indexed, and fresh. Crawl budget is the mechanism behind all three.

For a small, clean site, this is not a worry. For a large or fast-moving one, crawl budget is the line between your best pages sitting in the AI candidate pool or being invisible to it. The winners are not running a secret AI playbook. They are running clean technical SEO with the discipline that query fan-out now demands. If you want the full mechanism, with every claim traced to Google's own documentation, read the companion research report, and for the broader picture of winning AI visibility, start with GEO Compass.

If Google cannot crawl it, no AI surface Google operates can ever cite it. Start there.

Get the newsletter

New writing on identity, AI security, and building software, delivered when it ships. No tracking pixels, no funnels, unsubscribe with one click.