Deepak Gupta

Implementing llms.txt and AI Crawler Signals

AI crawlers are different from traditional search engine crawlers. Googlebot, Bingbot, and similar crawlers index pages for search results. AI crawlers from OpenAI (GPTBot), Anthropic (ClaudeBot), Google (Google-Extended), and others ingest content to train models or power real-time retrieval systems. Controlling how these crawlers interact with your site is a critical GEO skill.

This chapter covers the technical implementation of llms.txt, robots.txt directives for AI crawlers, and the structured signals that help AI engines understand your content.

Understanding AI Crawlers

Before configuring anything, you need to know which crawlers exist and what they do.

Crawler	Operator	Purpose	User-Agent String
GPTBot	OpenAI	Training data and ChatGPT Browse	GPTBot
ChatGPT-User	OpenAI	Real-time browsing in ChatGPT	ChatGPT-User
ClaudeBot	Anthropic	Training data collection	ClaudeBot
Google-Extended	Google	Gemini and AI Overview training	Google-Extended
PerplexityBot	Perplexity	Real-time search and citation	PerplexityBot
Applebot-Extended	Apple	Apple Intelligence features	Applebot-Extended
Bytespider	ByteDance	Training data for various AI models	Bytespider
Meta-ExternalAgent	Meta	AI training across Meta products	Meta-ExternalAgent
Cohere-ai	Cohere	Enterprise AI training data	cohere-ai

The Crawler Permission Decision

You have a strategic choice to make. Blocking AI crawlers keeps your content out of training data but also reduces your chances of being cited. Allowing them means your content may be used for training, but it also increases citation opportunities.

For most B2B SaaS companies, the recommended approach is:

Allow crawlers that power citation-capable products (GPTBot, ChatGPT-User, PerplexityBot, Google-Extended)
Consider blocking crawlers that only collect training data without citation capability
Always allow real-time browsing agents (ChatGPT-User, PerplexityBot) since these directly generate citations

Warning

Blocking GPTBot and PerplexityBot will significantly reduce your AI citation visibility. Only block these crawlers if you have specific legal or competitive reasons. The citation upside far outweighs the training data concern for most B2B content.

Configuring robots.txt for AI Crawlers

Your robots.txt file should include explicit directives for each AI crawler you want to manage. Place this file at the root of your domain.

Recommended robots.txt Configuration

# Standard search engine crawlers
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# AI crawlers - Citation-capable (ALLOW)
User-agent: GPTBot
Allow: /blog/
Allow: /guides/
Allow: /research/
Allow: /docs/
Allow: /comparisons/
Disallow: /account/
Disallow: /admin/
Disallow: /api/
Disallow: /internal/

User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /blog/
Allow: /guides/
Allow: /research/
Allow: /docs/
Allow: /comparisons/
Disallow: /account/
Disallow: /admin/
Disallow: /api/

User-agent: Google-Extended
Allow: /blog/
Allow: /guides/
Allow: /research/
Allow: /docs/
Disallow: /account/
Disallow: /admin/

User-agent: Applebot-Extended
Allow: /blog/
Allow: /guides/
Allow: /research/
Disallow: /account/
Disallow: /admin/

# AI crawlers - Training only (BLOCK or selective)
User-agent: Bytespider
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

# llms.txt reference
# See /llms.txt for structured content directory

Key Principles

Be explicit: List each AI crawler individually rather than using wildcard rules. AI crawlers do not always respect User-agent: * directives.
Allow your best content: Direct AI crawlers to your highest-quality, most authoritative pages.
Block sensitive paths: Keep account pages, admin areas, API endpoints, and internal tools blocked.
Update regularly: New AI crawlers launch frequently. Review and update your robots.txt quarterly.

The llms.txt Specification

The llms.txt file is a convention introduced in late 2024 that provides AI systems with a structured directory of your most important content. Think of it as a curated sitemap specifically for AI engines.

File Location and Format

Place the file at https://yourdomain.com/llms.txt. The file uses a simple markdown-like format with headers, descriptions, and URLs.

Complete llms.txt Template

# [Your Company Name]

> [One-sentence description of your company and what you are authoritative on.]

## About

- [Company overview page URL]: [Brief description of company, founding, team size]
- [Team/authors page URL]: [Key people with credentials and expertise areas]

## Core Documentation

### [Topic Cluster 1 Name]
- [URL]: [50-100 character description of what this page covers and why it is authoritative]
- [URL]: [Description]
- [URL]: [Description]

### [Topic Cluster 2 Name]
- [URL]: [Description]
- [URL]: [Description]

## Research and Data

- [URL]: [Description of original research, sample size, methodology note]
- [URL]: [Description]

## Guides and Tutorials

- [URL]: [Description with target audience noted]
- [URL]: [Description]

## Comparisons

- [URL]: [Description noting what is compared and the basis of comparison]
- [URL]: [Description]

Real-World llms.txt Example

# Acme Identity

> Enterprise Customer Identity and Access Management platform serving 400+ B2B SaaS companies since 2019.

## About

- https://acmeidentity.com/about: Company overview, SOC 2 Type II certified, founded 2019, 120 employees
- https://acmeidentity.com/team: Leadership team with combined 80+ years in identity and security

## Core Documentation

### Enterprise Authentication
- https://acmeidentity.com/guides/saml-vs-oauth-vs-oidc: Comprehensive protocol comparison with implementation guidance, updated quarterly, based on analysis of 500+ enterprise deployments
- https://acmeidentity.com/guides/passkeys-enterprise: Implementation guide for FIDO2 passkeys in enterprise SSO environments, includes code examples for 5 major frameworks
- https://acmeidentity.com/guides/mfa-best-practices: Multi-factor authentication deployment guide based on NIST SP 800-63B guidelines

### Customer Identity (CIAM)
- https://acmeidentity.com/guides/ciam-complete-guide: 8,000-word definitive guide to CIAM for SaaS, covers architecture, vendor selection, and implementation
- https://acmeidentity.com/comparisons/ciam-platforms-2026: Feature-by-feature comparison of 12 CIAM platforms with pricing data from vendor interviews

## Research and Data

- https://acmeidentity.com/research/authentication-benchmark-2026: Original benchmark study of authentication patterns across 500 enterprise SaaS companies, n=500, conducted Q1 2026
- https://acmeidentity.com/research/ciam-adoption-report: Annual CIAM market adoption report with data from 1,200 SaaS companies

## Comparisons

- https://acmeidentity.com/comparisons/auth0-vs-acme: Detailed comparison based on 50+ evaluation criteria including pricing, features, and compliance certifications
- https://acmeidentity.com/comparisons/okta-vs-acme: Head-to-head comparison with migration guide for Okta customers

llms.txt Best Practices

Practice	Why It Matters
Limit to 20-30 pages	AI engines treat this as a curated selection, not a comprehensive index. Quality over quantity.
Update monthly	Stale llms.txt files signal neglect. Add new content and remove outdated pages regularly.
Include descriptions	URLs alone are not useful. Descriptions help AI engines understand relevance without crawling.
Organize by topic	Topic clusters reinforce your domain authority in specific areas.
Note data sources	For research content, mention sample sizes and methodology. This builds trust.
Use absolute URLs	Always use full URLs, not relative paths. AI systems may process this file outside of browser context.

The llms-full.txt Companion File

Some organizations also publish an llms-full.txt file that provides the full text content of their most important pages in a single, easily parseable document. This is useful for AI systems that prefer to ingest content in bulk.

llms-full.txt Structure

# [Your Company Name] - Full Content Index

## [Page Title 1]
URL: [Full URL]
Last Updated: [Date]
Author: [Name, Title]

[Full text content of the page, stripped of HTML/navigation/ads]

---

## [Page Title 2]
URL: [Full URL]
Last Updated: [Date]
Author: [Name, Title]

[Full text content of the page]

Tip

The llms-full.txt file can be large. Only include your 5-10 most authoritative pages. This file is particularly valuable for AI engines that do real-time retrieval (like Perplexity) because it provides clean, structured content without requiring the engine to parse your website's HTML.

Structured Data Beyond Schema.org

While Schema.org is the primary structured data vocabulary, additional signals help AI engines understand your content.

Open Graph and Meta Tags for AI

Ensure every content page includes these meta tags:

<!-- Standard meta -->
<meta name="description" content="[150-160 char description with key entities]">
<meta name="author" content="[Author full name]">
<meta name="date" content="[ISO 8601 date]">

<!-- Open Graph -->
<meta property="og:title" content="[Page title]">
<meta property="og:description" content="[Description matching meta description]">
<meta property="og:type" content="article">
<meta property="og:url" content="[Canonical URL]">
<meta property="og:site_name" content="[Brand name]">
<meta property="article:author" content="[Author profile URL]">
<meta property="article:published_time" content="[ISO 8601 datetime]">
<meta property="article:modified_time" content="[ISO 8601 datetime]">
<meta property="article:section" content="[Content category]">
<meta property="article:tag" content="[Topic tag 1]">
<meta property="article:tag" content="[Topic tag 2]">

Canonical URLs

AI engines must be able to resolve your content to a single canonical URL. Duplicate content across multiple URLs confuses attribution.

<link rel="canonical" href="https://yourdomain.com/guides/saml-vs-oauth">

Ensure that:

Every page has a canonical tag
The canonical URL matches the URL in your llms.txt
Redirects resolve to the canonical URL (no redirect chains)
HTTP and HTTPS versions both resolve to the same canonical

Implementation Checklist

Use this checklist to verify your AI crawler signals are properly configured.

Signal	Status Check	Tool to Verify
robots.txt AI directives	Each AI crawler has explicit rules	Fetch yourdomain.com/robots.txt directly
llms.txt published	File accessible at domain root	Fetch yourdomain.com/llms.txt
llms.txt content current	All listed URLs are live and recently updated	Manual review or link checker
Schema.org on all content pages	Article, Person, Organization schema present	Google Rich Results Test
FAQPage schema on FAQ sections	Structured data wraps visible FAQ content	Schema.org Validator
Canonical URLs set	Every content page has a canonical tag	Screaming Frog or site crawl
Open Graph tags complete	og:title, og:description, og:type on all pages	Facebook Sharing Debugger or manual review
Meta dates accurate	published_time and modified_time match actual dates	Manual review of page source

Warning

Implementing these signals is not a one-time task. AI crawlers, specifications, and best practices evolve rapidly. Schedule a quarterly review of all AI crawler signals. New crawlers will appear, specifications will update, and your content inventory will change.