Skip to content

Future Tech/infrastructure

Synthetic Data Becomes the Primary AI Training Data

The internet ran out of high-quality text for AI training. Synthetic data is filling the gap. By 2028, more AI training tokens come from AI than from humans.

// By 2028 · medium confidence · disruption 8/10

Prediction

// 2028

By 2028, more than 60% of AI training data tokens will be synthetic (generated by AI models) rather than collected from real-world human sources.

Confidencemedium
Disruption8/10

What dies

  • the dvd rental store

Who wins

  • Anthropic
  • OpenAI
  • Scale AI

filed: 2026-05-24 · guptadeepak.com

The hook

The internet's stock of high-quality human-written text is being exhausted as AI training input. Major model labs have already crawled most accessible high-quality content. Synthetic data is now the path forward, and it is working better than skeptics predicted.

Thesis. Synthetic data is not a stopgap. It is the next training data paradigm. AI models generating training data for the next generation of models creates self-improving feedback loops with profound legal, attribution, and safety implications.

The story

The current state

Microsoft Phi-3 trained partly on synthetic data and outperformed expectations. Anthropic has published on synthetic data scaling. Scale AI pivoted toward synthetic-data services. Custom synthetic datasets for specific domains are now production.

The inflection point

Around 2023, the curve flipped. Real-world high-quality text became the bottleneck. Synthetic data, treated skeptically until then, proved that it could match or exceed real data for many training objectives. The publication wave from major labs in 2024 confirmed the direction.

The prediction

By 2028, more than 60% of training tokens come from synthetic sources. The remaining 40% is real-world data plus reinforcement learning from human feedback. The training corpus mix inverts.

Who wins, who loses

Winners: foundation model labs that crack synthetic-data quality, infrastructure vendors (Scale AI, Tonic AI, Mostly AI) that productize synthetic generation, and privacy-sensitive industries (healthcare, finance) that get usable training data without real-data exposure. Losers: the data-collection-as-a-service category, and the DVD-rental-era assumption that data is something you store and rent out.

Timeline and risks

Model collapse is the real risk: training on AI-generated data can degrade model quality if the synthesis loop is not carefully constructed. Active research on data-curation methods is ongoing. The legal question of who owns synthetic data trained on copyrighted real data is unsettled.

First signals (verify today)

Anthropic publishes papers on synthetic data scaling. Microsoft Phi models trained on synthetic data. Sakana AI synthetic training. Scale AI pivoting toward synthetic.

Key data points

  • Microsoft Phi-3 synthetic data training: 2023 to 2024
  • Scale AI revenue: $1B+ in 2024
  • Common Crawl coverage: most of public web by 2023
  • Estimated high-quality text on internet: 10 to 100 trillion tokens
  • Synthetic data market estimated size: $5B+ by 2028

Contrarian angle

The 'AI training data shortage' narrative is half right. Real-world data has limits. Synthetic data, generated by AI from its own knowledge, is unbounded. This creates a feedback loop where AI's understanding of the world increasingly comes from AI's understanding of the world. The model collapse risk and the privacy implications are not well understood.

The flip side

What this kills

The paired obituary in Tech Graveyard.

Read the obituary

FAQ

What is synthetic data and how is it different from fake data?

Synthetic data is generated by a model trained on real data, designed to preserve statistical properties while not being any specific real record. Fake data is fabricated without that statistical fidelity.

Is synthetic data better or worse than real data for training?

Mixed. For some tasks (code, math, reasoning chains) synthetic data is better because it is cleaner and more diverse. For others (rare facts, edge cases) real data is better. The optimal training mix is itself an active research question.

Does training on synthetic data cause model collapse?

It can, if synthesis is naive. Careful curation (deduplication, diversity sampling, quality filtering) prevents collapse. The labs that get this right pull ahead.

Are there privacy advantages to synthetic data?

Yes, with caveats. Synthetic data trained on real data may still leak signal about training records. Differential privacy applied at synthesis time gives provable guarantees.

More from guptadeepak.com

Want the technical deep-dive behind this prediction?

Read the companion article

More from the infrastructure desk.