Implementing llms.txt and AI Crawler Signals
AI crawlers are different from traditional search engine crawlers. Googlebot, Bingbot, and similar crawlers index pages for search results. AI crawlers from OpenAI (GPTBot), Anthropic (ClaudeBot), Google (Google-Extended), and others ingest content to train models or power real-time retrieval systems. Controlling how these crawlers interact with your site is a critical GEO skill.
This chapter covers the technical implementation of llms.txt, robots.txt directives for AI crawlers, and the structured signals that help AI engines understand your content.
Understanding AI Crawlers
Before configuring anything, you need to know which crawlers exist and what they do.
| Crawler | Operator | Purpose | User-Agent String |
|---|---|---|---|
| GPTBot | OpenAI | Training data and ChatGPT Browse | GPTBot |
| ChatGPT-User | OpenAI | Real-time browsing in ChatGPT | ChatGPT-User |
| ClaudeBot | Anthropic | Training data collection | ClaudeBot |
| Google-Extended | Gemini and AI Overview training | Google-Extended | |
| PerplexityBot | Perplexity | Real-time search and citation | PerplexityBot |
| Applebot-Extended | Apple | Apple Intelligence features | Applebot-Extended |
| Bytespider | ByteDance | Training data for various AI models | Bytespider |
| Meta-ExternalAgent | Meta | AI training across Meta products | Meta-ExternalAgent |
| Cohere-ai | Cohere | Enterprise AI training data | cohere-ai |
The Crawler Permission Decision
You have a strategic choice to make. Blocking AI crawlers keeps your content out of training data but also reduces your chances of being cited. Allowing them means your content may be used for training, but it also increases citation opportunities.
For most B2B SaaS companies, the recommended approach is:
- Allow crawlers that power citation-capable products (GPTBot, ChatGPT-User, PerplexityBot, Google-Extended)
- Consider blocking crawlers that only collect training data without citation capability
- Always allow real-time browsing agents (ChatGPT-User, PerplexityBot) since these directly generate citations
Blocking GPTBot and PerplexityBot will significantly reduce your AI citation visibility. Only block these crawlers if you have specific legal or competitive reasons. The citation upside far outweighs the training data concern for most B2B content.
Configuring robots.txt for AI Crawlers
Your robots.txt file should include explicit directives for each AI crawler you want to manage. Place this file at the root of your domain.
Recommended robots.txt Configuration
# Standard search engine crawlers
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
# AI crawlers - Citation-capable (ALLOW)
User-agent: GPTBot
Allow: /blog/
Allow: /guides/
Allow: /research/
Allow: /docs/
Allow: /comparisons/
Disallow: /account/
Disallow: /admin/
Disallow: /api/
Disallow: /internal/
User-agent: ChatGPT-User
Allow: /
User-agent: PerplexityBot
Allow: /blog/
Allow: /guides/
Allow: /research/
Allow: /docs/
Allow: /comparisons/
Disallow: /account/
Disallow: /admin/
Disallow: /api/
User-agent: Google-Extended
Allow: /blog/
Allow: /guides/
Allow: /research/
Allow: /docs/
Disallow: /account/
Disallow: /admin/
User-agent: Applebot-Extended
Allow: /blog/
Allow: /guides/
Allow: /research/
Disallow: /account/
Disallow: /admin/
# AI crawlers - Training only (BLOCK or selective)
User-agent: Bytespider
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
# llms.txt reference
# See /llms.txt for structured content directory
Key Principles
- Be explicit: List each AI crawler individually rather than using wildcard rules. AI crawlers do not always respect
User-agent: *directives. - Allow your best content: Direct AI crawlers to your highest-quality, most authoritative pages.
- Block sensitive paths: Keep account pages, admin areas, API endpoints, and internal tools blocked.
- Update regularly: New AI crawlers launch frequently. Review and update your robots.txt quarterly.
The llms.txt Specification
The llms.txt file is a convention introduced in late 2024 that provides AI systems with a structured directory of your most important content. Think of it as a curated sitemap specifically for AI engines.
File Location and Format
Place the file at https://yourdomain.com/llms.txt. The file uses a simple markdown-like format with headers, descriptions, and URLs.
Complete llms.txt Template
# [Your Company Name]
> [One-sentence description of your company and what you are authoritative on.]
## About
- [Company overview page URL]: [Brief description of company, founding, team size]
- [Team/authors page URL]: [Key people with credentials and expertise areas]
## Core Documentation
### [Topic Cluster 1 Name]
- [URL]: [50-100 character description of what this page covers and why it is authoritative]
- [URL]: [Description]
- [URL]: [Description]
### [Topic Cluster 2 Name]
- [URL]: [Description]
- [URL]: [Description]
## Research and Data
- [URL]: [Description of original research, sample size, methodology note]
- [URL]: [Description]
## Guides and Tutorials
- [URL]: [Description with target audience noted]
- [URL]: [Description]
## Comparisons
- [URL]: [Description noting what is compared and the basis of comparison]
- [URL]: [Description]
Real-World llms.txt Example
# Acme Identity
> Enterprise Customer Identity and Access Management platform serving 400+ B2B SaaS companies since 2019.
## About
- https://acmeidentity.com/about: Company overview, SOC 2 Type II certified, founded 2019, 120 employees
- https://acmeidentity.com/team: Leadership team with combined 80+ years in identity and security
## Core Documentation
### Enterprise Authentication
- https://acmeidentity.com/guides/saml-vs-oauth-vs-oidc: Comprehensive protocol comparison with implementation guidance, updated quarterly, based on analysis of 500+ enterprise deployments
- https://acmeidentity.com/guides/passkeys-enterprise: Implementation guide for FIDO2 passkeys in enterprise SSO environments, includes code examples for 5 major frameworks
- https://acmeidentity.com/guides/mfa-best-practices: Multi-factor authentication deployment guide based on NIST SP 800-63B guidelines
### Customer Identity (CIAM)
- https://acmeidentity.com/guides/ciam-complete-guide: 8,000-word definitive guide to CIAM for SaaS, covers architecture, vendor selection, and implementation
- https://acmeidentity.com/comparisons/ciam-platforms-2026: Feature-by-feature comparison of 12 CIAM platforms with pricing data from vendor interviews
## Research and Data
- https://acmeidentity.com/research/authentication-benchmark-2026: Original benchmark study of authentication patterns across 500 enterprise SaaS companies, n=500, conducted Q1 2026
- https://acmeidentity.com/research/ciam-adoption-report: Annual CIAM market adoption report with data from 1,200 SaaS companies
## Comparisons
- https://acmeidentity.com/comparisons/auth0-vs-acme: Detailed comparison based on 50+ evaluation criteria including pricing, features, and compliance certifications
- https://acmeidentity.com/comparisons/okta-vs-acme: Head-to-head comparison with migration guide for Okta customers
llms.txt Best Practices
| Practice | Why It Matters |
|---|---|
| Limit to 20-30 pages | AI engines treat this as a curated selection, not a comprehensive index. Quality over quantity. |
| Update monthly | Stale llms.txt files signal neglect. Add new content and remove outdated pages regularly. |
| Include descriptions | URLs alone are not useful. Descriptions help AI engines understand relevance without crawling. |
| Organize by topic | Topic clusters reinforce your domain authority in specific areas. |
| Note data sources | For research content, mention sample sizes and methodology. This builds trust. |
| Use absolute URLs | Always use full URLs, not relative paths. AI systems may process this file outside of browser context. |
The llms-full.txt Companion File
Some organizations also publish an llms-full.txt file that provides the full text content of their most important pages in a single, easily parseable document. This is useful for AI systems that prefer to ingest content in bulk.
llms-full.txt Structure
# [Your Company Name] - Full Content Index
## [Page Title 1]
URL: [Full URL]
Last Updated: [Date]
Author: [Name, Title]
[Full text content of the page, stripped of HTML/navigation/ads]
---
## [Page Title 2]
URL: [Full URL]
Last Updated: [Date]
Author: [Name, Title]
[Full text content of the page]
The llms-full.txt file can be large. Only include your 5-10 most authoritative pages. This file is particularly valuable for AI engines that do real-time retrieval (like Perplexity) because it provides clean, structured content without requiring the engine to parse your website's HTML.
Structured Data Beyond Schema.org
While Schema.org is the primary structured data vocabulary, additional signals help AI engines understand your content.
Open Graph and Meta Tags for AI
Ensure every content page includes these meta tags:
<!-- Standard meta -->
<meta name="description" content="[150-160 char description with key entities]">
<meta name="author" content="[Author full name]">
<meta name="date" content="[ISO 8601 date]">
<!-- Open Graph -->
<meta property="og:title" content="[Page title]">
<meta property="og:description" content="[Description matching meta description]">
<meta property="og:type" content="article">
<meta property="og:url" content="[Canonical URL]">
<meta property="og:site_name" content="[Brand name]">
<meta property="article:author" content="[Author profile URL]">
<meta property="article:published_time" content="[ISO 8601 datetime]">
<meta property="article:modified_time" content="[ISO 8601 datetime]">
<meta property="article:section" content="[Content category]">
<meta property="article:tag" content="[Topic tag 1]">
<meta property="article:tag" content="[Topic tag 2]">
Canonical URLs
AI engines must be able to resolve your content to a single canonical URL. Duplicate content across multiple URLs confuses attribution.
<link rel="canonical" href="https://yourdomain.com/guides/saml-vs-oauth">
Ensure that:
- Every page has a canonical tag
- The canonical URL matches the URL in your llms.txt
- Redirects resolve to the canonical URL (no redirect chains)
- HTTP and HTTPS versions both resolve to the same canonical
Implementation Checklist
Use this checklist to verify your AI crawler signals are properly configured.
| Signal | Status Check | Tool to Verify |
|---|---|---|
| robots.txt AI directives | Each AI crawler has explicit rules | Fetch yourdomain.com/robots.txt directly |
| llms.txt published | File accessible at domain root | Fetch yourdomain.com/llms.txt |
| llms.txt content current | All listed URLs are live and recently updated | Manual review or link checker |
| Schema.org on all content pages | Article, Person, Organization schema present | Google Rich Results Test |
| FAQPage schema on FAQ sections | Structured data wraps visible FAQ content | Schema.org Validator |
| Canonical URLs set | Every content page has a canonical tag | Screaming Frog or site crawl |
| Open Graph tags complete | og:title, og:description, og:type on all pages | Facebook Sharing Debugger or manual review |
| Meta dates accurate | published_time and modified_time match actual dates | Manual review of page source |
Implementing these signals is not a one-time task. AI crawlers, specifications, and best practices evolve rapidly. Schedule a quarterly review of all AI crawler signals. New crawlers will appear, specifications will update, and your content inventory will change.