Deepak Gupta

Implementation · GEO · 11 min read · last updated 2026-05-21

The llms.txt specification, in depth

What llms.txt actually is, what llms-full.txt adds, who's adopting it, and how to ship one that AI engines might actually use

llms.txt began as a community proposal in 2024 (llmstxt.org) and has become the de facto AI-era signal file, published by Stripe, Anthropic, Cloudflare, Vercel, and a growing list of content publishers. This guide covers what the spec actually says, what changes when you ship one, and how to write one that's useful rather than performative.

The minimum viable spec

llms.txt is a Markdown file served at the root of your domain (/llms.txt). The minimum required structure:

An H1 with the site name
Optionally, a blockquote summarizing what the site is about
One or more H2 sections grouping links to authoritative pages
Each link in the format - [Title](URL): one-line description

That's the whole spec. Everything else (sectioning style, content depth, how aggressive to be about which URLs to include) is convention.

Example minimal file:

# Example Site

> Vendor-neutral resource on topic X. Authored by Y.

## Core pages

- [Home](https://example.com/): What we are and what we cover.
- [Methodology](https://example.com/methodology/): How we evaluate and write.

## Guides

- [Guide 1](https://example.com/guides/one/): One-liner.
- [Guide 2](https://example.com/guides/two/): One-liner.

What llms-full.txt adds

llms-full.txt is the companion corpus file: instead of an index of URLs, it concatenates the full Markdown content of priority pages into a single file. The intended use is single-fetch corpus ingestion by AI engines that prefer one large request over N small ones.

The format is unstructured beyond per-section headers identifying URL, title, and last-updated date for each included page. There's no formal spec; current published files vary in format. A common convention:

# Site Name: full content

[brief intro and ownership note]

================================================================================
# Guide Title
URL: https://example.com/guides/one/
Last updated: 2026-05-21

[full Markdown of the guide]

================================================================================
# Next page...

Practical considerations: keep llms-full.txt under a few megabytes (some engines reject larger files), include only your highest-authority content (not every page on the site), and update on every content change via your build pipeline.

Who's actually using it

The honest answer in 2026: adoption among publishers is rising rapidly, but formal endorsement from the major AI engines is still pending. Anthropic publishes one (which is itself a notable adoption signal). OpenAI, Google, Microsoft, and Perplexity have not formally endorsed the spec.

The asymmetric upside makes shipping one a no-regret move: the cost is a few KB of static text and a build step, and the benefit is potentially being preferentially crawled by engines that adopt the pattern. Most leading publishers ship both files for this reason.

The argument against is small: primarily, the maintenance burden if your URL structure changes frequently. For static-site or content-managed sites with stable URLs, the maintenance cost is essentially zero.

What to actually include

A useful llms.txt is curated, not exhaustive. The mistake most first-time publishers make is treating llms.txt like a sitemap and including every URL on the site. The result is noise that defeats the purpose.

The right approach: think of llms.txt as the answer to "if an AI engine could only crawl 20-50 URLs from my site, which would they be?" Include:

The home page and any major hub pages
Your methodology page (if you have one, and you should, for AI engine trust signals)
Pillar guides and authoritative explainers
Glossary entries if you have them
Major reference content
Active vendor or product profiles if those are your authority surface

Don't include:

Blog post archives in bulk (pick the 5-10 best, skip the rest)
Listicles that aren't differentiated (the ones that are, keep)
Auto-generated pages (tag pages, paginated archives, etc.)
Outdated content you wouldn't want quoted

Common mistakes

Treating it like a sitemap. Sitemap.xml is for traditional crawlers and indexers; llms.txt is for AI engines making selection decisions. The right level of detail is "this is what I want quoted" not "this is everything I publish."

Skipping the H2 section structure. Some files are just an unstructured list of URLs. The H2 sections give engines semantic grouping (Core, Guides, Reference, Vendor profiles, etc.) that helps with selection.

Stale dating. llms.txt is not particularly dated, but the pages it links to should be. AI engines weight freshness heavily; a llms.txt full of links to 2022 content signals stale corpus.

Missing the blockquote summary. The opening blockquote is what an engine reads to decide "is this site authoritative on what I'm being asked about?" Without it, the engine has to infer your site's domain from the link titles. Write a clear one-paragraph statement of what you cover and who authors it.

No llms-full.txt. Publishing only the index without the corpus misses the asymmetric upside. Both files cost almost nothing to generate; ship both.

How to integrate it into your build

For static sites with structured content (Markdown sources or a headless CMS), the build pipeline can generate both files from the same source data. Most of the leading portals in our reference set have a 50-100 line script that:

Walks the content directory
Filters to "authoritative" pages (per a frontmatter flag or path convention)
Sorts by section and date
Emits llms.txt with the structured H2 sections
Emits llms-full.txt with the concatenated Markdown bodies

Wire it into the same build step that produces your static export. Update both files on every content deploy.

Practical checklist

[ ] llms.txt at /llms.txt with H1, blockquote, H2 sections, and curated link list
[ ] llms-full.txt at /llms-full.txt with concatenated Markdown of priority pages
[ ] Both files updated on every content build
[ ] Authoritative pages selected (50-200 URLs total, not the whole site)
[ ] Section structure makes selection easier for AI engines (Methodology, Guides, Reference, Glossary, etc.)
[ ] Files validate as Markdown
[ ] BingBot, ClaudeBot, OAI-SearchBot, PerplexityBot, GPTBot all allowed in robots.txt
[ ] Updates announced in your sitemap.xml's lastmod (signals freshness to traditional crawlers)

That's the whole discipline. The spec is small enough to ship in an afternoon. The maintenance cost is small enough to forget about. The upside is real, asymmetric, and growing as more engines incorporate the signal.

Related guides