Skip to content

Measurement · AEO + GEO · 14 min read · last updated 2026-06-30

The AEO and GEO experimentation roadmap: a test-and-learn system for AI visibility

A repeatable prompt-bank, KPI, and cadence system for measuring and improving citation share across ChatGPT, Claude, Perplexity, and Gemini

An AEO and GEO experimentation program is a structured, repeatable test-and-learn loop for improving your citation share across AI engines. You fix a bank of representative prompts, sample each engine on a schedule, record who gets cited, change one input at a time, and remeasure. The point is a system, not a one-off audit.

The reason this needs to be a system rather than a spot check is that AI answers are non-deterministic and drift week to week. A single lookup tells you almost nothing. A disciplined loop, run against a stable prompt bank on a fixed cadence, is the only way to separate real movement from noise.

Why experimentation is different in AI search

Traditional SEO rank tracking assumes a stable ground truth: query a keyword, read a ranked list, and the list is close to identical if you re-run it minutes later. AI search breaks that assumption. The same prompt sent to the same model twice can return a different set of recommended brands, in a different order, with different citations.

SparkToro's research, reported second-hand by Search Engine Land, put a number on this: there is "less than a 1 in 100 chance that any AI tool, if asked the same question 100 times, will give the same list of brands in any two responses" (searchengineland.com). That is the central design constraint. If two runs of one prompt rarely agree, then one measurement is not a measurement; it is a sample of size one from a noisy distribution.

Three consequences follow:

  • Sample, do not spot-check. You need many prompts and repeat runs so the signal averages out. Coverage of intents beats volume on any single query.
  • Track distributions, not positions. The useful unit is share across a query set over time, not a single rank on a single query.
  • Expect drift. Month-over-month citation drift of 40-60% is normal (authoritytech.io). A 20% swing on one query is inside the noise band; do not treat it as a result.

The macro backdrop makes this worth the effort. Similarweb data reported by Search Engine Land found 68% of US Google searches ended without a click in the first four months of 2026 (up from 60.45% in 2024), and AI Overviews appearing on 20%+ of searches cut click-through roughly 60% (searchengineland.com). Visibility inside the answer, not just the blue link below it, is now the thing to measure.

Building a prompt bank

The prompt bank is the instrument. Everything downstream (KPIs, cadence, A/B tests) reads off it, so it has to be representative and stable. Build 40-120 prompts per language, and prioritize coverage of intent over raw volume (ailabsaudit.com).

Spread the bank across six intent types:

  • Branded: prompts that name your company or product directly.
  • Non-branded: category questions where you want to be surfaced without being named ("best tools for X").
  • Comparative: "X vs Y" and "alternatives to Z" prompts.
  • Problem-led: the underlying pain, phrased as a user would ("how do I stop account takeover").
  • Persona-led: the same problems voiced by distinct buyers (a CISO, a founder, a developer).
  • Geographic: locale-specific variants where market and language matter.

A practical way to generate the bank: write a one-page scoping note (your category, top competitors, core problems, target personas, priority geographies) and have an LLM expand it into candidate prompts across the six intents. Then edit by hand. The LLM gets you breadth fast; your judgment removes the prompts no real buyer would ever type.

Run two passes per model, because the two modes answer from different places:

  • Native / parametric pass: the model answers from its trained weights, with browsing off. This tells you how the model "remembers" your category.
  • Web / browsing pass: the model retrieves and grounds against live sources. This tells you which pages get cited right now.

The gap between the two passes is itself a finding. Strong native presence with weak web citation means the model knows you but is not grounding on your current pages; the reverse means your pages are citable but the model's baseline knowledge does not include you.

A useful anchor from practice: teams maintain a monthly library of roughly 50-100 queries to compute a citation rate over time (stackmatix.com). Mike King's AI Search Manual frames the broader measurement discipline as "relevance engineering," and ships prompt recipes for retrieval simulation and a citation tracker to operationalize exactly this loop (ipullrank.com).

The model set to test

Test across engines, because presence does not transfer between them. A brand cited heavily by Perplexity can be invisible in Gemini. The core set:

  • ChatGPT
  • Claude
  • Perplexity
  • Gemini

Add Grok and Mistral (and Llama-based assistants) where your geography or audience warrants it. Weight your effort by where your buyers actually are. Conductor's 2026 benchmarks, drawn from 13,770 domains and 3.5M prompts, found ChatGPT drives 87.4% of AI referral traffic (conductor.com), so ChatGPT earns the most measurement attention, but the others still shape perception even when they send less traffic.

What to measure: the KPIs

Pick a small, honest KPI set and hold it stable. The headline metric is AI Share of Voice.

AI Share of Voice % = (sum of weighted scores for your brand) / (sum of all brands' weighted scores), computed over your query set, across engines, over a time window.

The "weighted score" is where you encode what matters to you: a citation, a recommendation, or a top-of-list mention can each carry a different weight. What matters is that the weighting is written down and applied identically every run, so movement reflects the world and not a change in your own accounting.

KPIWhat it answersHow to read it
Citation shareOf all sources cited on your query set, what fraction are yours?Rises as your pages become groundable
AI Share of Voice %Your weighted presence vs all brands, per formula aboveThe headline trend line
Presence / mention rateOn what % of prompts do you appear at all?Coverage, before quality
Average position / recommendationWhen present, how prominently?Depth of presence
SentimentAre you described positively, neutrally, negatively?Quality of presence
Cross-engine consensusOn how many engines do you appear for the same intent?Durability across the field

An emerging label for the headline metric is "Share of Model." Aleyda Solis's three-layer framework is a useful way to organize the whole set: Presence (do you show up, accurately), Readiness (are the structures in place to be cited), and Business Impact (is value created) (learningaisearch.com). Rand Fishkin's guidance is consistent: track brand mention volume and share of voice inside LLM responses, not just rankings.

Cadence

Match measurement frequency to how fast each layer actually moves. Over-sampling burns budget on noise; under-sampling misses drift.

CadenceScopePurpose
Weekly10-15 priority queriesEarly-warning on high-value intents
MonthlyFull prompt bank sweepThe trend line of record
QuarterlyStrategic reviewReset the bank, competitors, and bets

Read the monthly full sweep as your source of truth. Because citation drift runs 40-60% month over month (authoritytech.io), never draw a conclusion from a single week's wobble on a single query. The weekly cut is a smoke alarm, not a scoreboard. The quarterly review is where you retire stale prompts, add new competitors, and re-scope the bank so it keeps tracking the market rather than last quarter's market.

Running a change-and-remeasure loop honestly

This is the part that demands intellectual honesty. The loop is simple in shape: baseline, change one thing, wait, remeasure, keep or revert. The hard part is not overclaiming.

Treat content A/B testing for citation as emerging practice, not settled science. There is no rigorous, peer-reviewed protocol proving that a specific formatting change causes a citation lift; current evidence is anecdotal and observational (braintrust.dev, flow-ai.com). Combine that with 40-60% baseline drift and you cannot cleanly attribute a small change to a small edit. So:

  • Change one variable at a time, and make it a meaningful one (restructuring a page for extractability, adding a comparison table, fixing attribution), not a cosmetic tweak.
  • Compare against the drift band. A result has to clear the noise floor to count. If your edit moves a metric less than typical month-over-month drift, you have not shown anything.
  • Give it realistic time. Long-tail queries can respond in 30-60 days; competitive head terms take 3-6 months (authoritytech.io). Judging a head-term test at three weeks guarantees a false read.
  • Write down the hypothesis before you look. Retrofitting a story onto a random swing is the easiest way to fool yourself in a non-deterministic system.

The honest framing to hold onto: you are steadily improving groundable, citation-worthy pages and watching a noisy aggregate trend upward over months. You are not proving single-edit causation on a single query. See citation-worthy content patterns for what to actually change between measurements.

Connecting to conversions

Visibility only matters if it ties to business outcomes, so the last piece of the system connects the prompt bank to revenue. The attribution stack has three layers (stackmatix.com):

  • Automated citation tracking across engines (your prompt-bank runs).
  • GA4 custom channel groups that isolate AI-referred traffic by source (ChatGPT, Perplexity, Gemini, and the rest), so AI referrals stop hiding inside "Direct" or "Referral."
  • A monthly prompt audit feeding both, then fractional attribution of AI-referred sessions to MQLs, SQLs, and revenue.

Set expectations honestly on volume. AI referral is still small: Conductor's 2026 benchmarks put it at 1.08% of all traffic, growing roughly 1% per month (conductor.com). It is a fast-growing sliver, not a flood.

The counterweight is that AI-referred traffic is widely reported to convert at higher rates. Vendor figures circulate (AI-referred conversion near 14% versus around 3% for Google search; B2B AEO multiples of 4x to 20x; one platform citing a 4.4x average). Treat all of these as vendor-sourced and unverified. Use them to justify measuring the channel, never to forecast pipeline. The defensible position: the channel is small today, plausibly high-intent, and worth instrumenting precisely so you can measure its real conversion rate on your own data rather than borrowing a vendor's.

A 90-day rollout plan

A concrete, sequenced checklist to stand the system up.

Days 1-30: instrument.

  • Write the one-page scoping note (category, competitors, problems, personas, geographies).
  • Generate 40-120 prompts across the six intents with an LLM; hand-edit down to a stable bank.
  • Fix the model set (ChatGPT, Claude, Perplexity, Gemini; add Grok / Mistral by geography).
  • Define the KPI set and write down the AI Share of Voice weighting.
  • Run the first full baseline: two passes (native and web) per model, across the whole bank.
  • Stand up GA4 custom channel groups isolating AI-referred traffic by source.

Days 31-60: establish rhythm.

  • Start the weekly cut on 10-15 priority queries.
  • Run the second monthly full sweep; compare to baseline and record the drift band you observe on your bank.
  • Identify the highest-value intents where you are absent or under-cited.
  • Ship the first change-and-remeasure test: one variable, one hypothesis written before you look, on tail queries where a 30-60 day read is realistic.

Days 61-90: read and connect.

  • Run the third monthly sweep; you now have three points and can distinguish trend from noise.
  • Judge (or extend) the first test against the drift band, not against zero.
  • Wire fractional attribution from AI-referred sessions to MQLs / SQLs.
  • Hold the first quarterly review: retire stale prompts, add competitors, re-scope the bank, and decide the next quarter's bets.

From there the loop simply repeats: weekly smoke checks, monthly truth, quarterly re-scope, one honest test at a time. For the tooling that automates the tracking, see choosing an AI visibility tool; for the metric definitions in depth, see measuring AI visibility; and for how AEO and GEO differ as disciplines, see AEO vs GEO explained.

Related guides

← All guides