Deepak Gupta

By Deepak GuptaPublished June 8, 2026AI Security

Mercor's 4TB Data Heist: When a Poisoned AI Library Exposed OpenAI and Meta's Training Pipeline

A poisoned LiteLLM package led to 4TB stolen from Mercor, the AI training startup serving Meta, OpenAI, and Anthropic. Class action lawsuits filed.

A single poisoned Python package has produced the most consequential AI supply chain breach of 2026.

On March 31, Mercor, a $10 billion AI training startup that recruits, vets, and pays the human experts who train frontier models for OpenAI, Anthropic, Meta, and Google, confirmed that approximately four terabytes of data had been stolen from its systems. The breach originated from the LiteLLM supply chain attack, when threat group TeamPCP published malicious versions of the widely used AI gateway library to PyPI on March 27.

The stolen data includes 939GB of platform source code, a 211GB user database, and roughly three terabytes of storage buckets containing video interviews, contractor passport scans, Social Security numbers, and identity verification documents. Lapsus$ claimed responsibility and listed the data for auction on the dark web. Meta indefinitely paused all data work with Mercor. Five contractor lawsuits have been filed as class actions.

This breach matters beyond its immediate victims because Mercor sits at a structural chokepoint of the modern AI economy. The company's contractors generate the training data that shapes the behavior of the world's most powerful language models. When their identities, work product, and verification documents are stolen, the compromise extends into the AI training pipeline itself.

The Attack Chain

The Mercor breach is a downstream consequence of the LiteLLM supply chain compromise that hit thousands of organizations in late March 2026.

TeamPCP compromised Trivy, a widely used open-source vulnerability scanner, on March 19 by rewriting Git tags to point to a malicious release. Through that foothold, they extracted publishing credentials from LiteLLM's CI/CD pipeline, which ran Trivy without pinned versions. On March 27, TeamPCP used those credentials to publish malicious LiteLLM versions 1.82.7 and 1.82.8 directly to PyPI.

The poisoned packages contained a multi-stage credential stealer that activated automatically on every Python process startup via a .pth file mechanism. The malware harvested AWS, GCP, and Azure tokens, SSH keys, Kubernetes configurations, database credentials, and API keys from every environment where the compromised version was installed.

LiteLLM is downloaded approximately 95 million times per month and is present in an estimated 36% of cloud environments, according to Wiz Research. Mercor was one of thousands of organizations that pulled the compromised package during the three-hour window before PyPI quarantined it. The stolen credentials gave attackers access to Mercor's internal systems, from which they exfiltrated the four-terabyte dataset.

Security firm Halborn's post-mortem confirmed the attackers used the harvested credentials to access Mercor's Tailscale VPN, then moved laterally through internal systems to reach storage buckets containing the most sensitive data: contractor identity documents and AI training artifacts.

What Was Stolen

The scope of the Mercor breach extends far beyond typical corporate data theft.

The 211GB user database contains personal information for over 40,000 contractors, including names, email addresses, phone numbers, and account credentials. But the bulk of the stolen data, roughly three terabytes, consists of storage buckets containing video interviews recorded during Mercor's contractor vetting process, passport scans and government ID documents submitted for identity verification, Social Security numbers collected for tax reporting, and work samples from AI training tasks.

The 939GB of source code includes Mercor's proprietary platform architecture, internal tooling, and potentially the methodologies used to generate and evaluate AI training data. For competitors, this represents a significant intelligence windfall. For Mercor's clients, particularly Meta, OpenAI, Anthropic, and Google, it raises questions about whether their AI training methodologies, evaluation criteria, or proprietary approaches were embedded in Mercor's code or contractor communications.

Meta's response was immediate: an indefinite pause on all AI data work with Mercor. The company had signed a $27 billion AI infrastructure deal with Nebius Group in March 2026 and has forecast capital expenditures of up to $135 billion for the year. Protecting its AI training pipeline is strategically critical.

Why This Breach Matters

AI Training Data Is the New Crown Jewel

The Mercor breach is the first major attack to treat AI training data as a primary target. Previous supply chain attacks focused on stealing cloud credentials, deploying ransomware, or harvesting financial data. The Mercor attack went after something more strategically valuable: the human-generated data that teaches frontier AI models how to reason, code, write, and make decisions.

When a company like Mercor is breached, the compromise ripples outward in ways that traditional data breaches do not. The contractor who submitted a passport scan for identity verification did not consent to having that document auctioned on the dark web. The AI training tasks those contractors completed may contain proprietary information about how their client companies evaluate model quality. The video interviews may reveal evaluation methodologies that competitors could exploit.

The Supply Chain Multiplier Effect

LiteLLM's presence in 36% of cloud environments means the TeamPCP supply chain attack created a blast radius that extends far beyond Mercor. The three-hour window during which the compromised package was available produced an estimated 40,000 downloads. Each download potentially compromised every credential accessible from the affected environment.

When I was building the CIAM platform that served over a billion users, we learned that identity infrastructure is only as secure as its weakest dependency. A single compromised library in the dependency chain can expose every secret that flows through the systems where it is installed. LiteLLM, by design, sits between applications and multiple AI model providers. It typically holds API keys for OpenAI, Anthropic, Google, Azure, and whatever other services the application integrates with. Compromising LiteLLM does not just expose one set of credentials. It exposes every AI provider credential in the environment.

The practical defense that separated victims from survivors in this attack was straightforward: organizations using lockfiles with poetry.lock or uv.lock with pinned versions and cryptographic hash verification were completely protected. The malicious packages never matched the expected hashes, so they were never installed. This is not a sophisticated security measure. It is basic dependency hygiene that most AI development teams still do not practice.

The Class Action Signal

Five contractor lawsuits have been filed against Mercor, with at least one naming BerriAI (LiteLLM's creator) and Delve Technologies as co-defendants. The legal argument is straightforward: contractors submitted sensitive personal information including passport scans and Social Security numbers, trusting that Mercor would protect that data. The supply chain attack compromised that trust.

The class action signal matters for the broader AI ecosystem because it establishes that AI companies bear responsibility for the security of their supply chain dependencies. The defense that "we were one of thousands of companies affected" does not absolve the obligation to protect contractor PII. The question the courts will ultimately address is whether organizations that fail to pin dependencies, verify package integrity, or isolate sensitive data from publicly-accessible credential stores have met a reasonable standard of care.

What Organizations Should Do

Pin every dependency with hash verification. This is the single highest-impact action. Organizations using lockfiles with cryptographic hashes were completely protected from the LiteLLM attack. This should be standard practice for every AI development team.

Isolate credential stores from AI gateway libraries. LiteLLM and similar AI gateway tools should not have direct access to cloud provider credentials, database passwords, or infrastructure secrets. Use dedicated secrets management (HashiCorp Vault, AWS Secrets Manager) with scoped, short-lived credentials rather than environment variables.

Segment AI training data from general infrastructure. Contractor PII, passport scans, and AI training artifacts should be stored in isolated environments with separate access controls, not in general-purpose storage buckets accessible from development systems. The zero trust principle applies: training data environments should authenticate every access request independently.

Implement an AI supply chain risk assessment. Map every open-source dependency in your AI stack. Identify which libraries have access to credentials, which handle sensitive data, and which are maintained by small teams or individual developers. The machine identity governance framework applies to AI dependencies as much as it does to service accounts and API keys.

Monitor for compromised credentials continuously. The TeamPCP credentials were available in criminal databases before they were used to publish the malicious LiteLLM packages. Credential monitoring services that watch for your organization's domains and API keys in dark web marketplaces can provide early warning before stolen credentials are operationalized.

Key Takeaways

Mercor, a $10B AI training startup serving Meta, OpenAI, Anthropic, and Google, confirmed 4TB of data stolen through the LiteLLM supply chain attack on March 31, 2026
Stolen data includes 939GB of source code, 211GB user database, and 3TB of contractor passport scans, SSNs, video interviews, and AI training artifacts
The attack originated from TeamPCP poisoning LiteLLM on PyPI, affecting an estimated 36% of cloud environments
Meta indefinitely paused all AI data work with Mercor; five class action lawsuits filed by affected contractors
Organizations using dependency lockfiles with hash verification were completely protected from the attack
This is the first major breach targeting AI training data as a primary objective, raising questions about the security of frontier model training pipelines
AI gateway libraries like LiteLLM create outsized blast radius because they hold credentials for multiple AI providers simultaneously

Get the newsletter

New writing on identity, AI security, and building software, delivered when it ships. No tracking pixels, no funnels, unsubscribe with one click.