Scaling to a Billion Users
Crossing the first million in ARR felt like reaching the summit. Then we looked up and realized the mountain was much taller. Scaling LoginRadius from a promising CIAM platform to a system handling over one billion identities across 180+ countries introduced an entirely new set of challenges - architectural, organizational, and operational.
This chapter covers the decisions, trade-offs, and hard lessons of scaling a security platform to massive scale.
Architecture Decisions That Defined Scale
When you are managing identities for over a billion users, architectural decisions you made at 10,000 users can either enable or destroy you. Several early architectural choices proved to be the difference between scaling gracefully and hitting walls.
Multi-Tenant Architecture
The first critical decision was multi-tenancy. We chose a shared infrastructure with isolated data stores per customer - not a separate deployment per customer.
LoginRadius Multi-Tenant Architecture
========================================
Customer A Customer B Customer C
| | |
v v v
+--------------------------------------+
| Shared Application Layer |
| (Authentication, Authorization, |
| Session Management, MFA Engine) |
+--------------------------------------+
| | |
v v v
+----------+ +----------+ +----------+
| Data | | Data | | Data |
| Store A | | Store B | | Store C |
| (Region) | | (Region) | | (Region) |
+----------+ +----------+ +----------+
Why multi-tenant: Single-tenant deployments are simpler to reason about but impossible to scale economically to hundreds of customers. Multi-tenancy let us amortize infrastructure costs, deploy updates once instead of per-customer, and maintain a single codebase.
The trade-off: Multi-tenancy in security is harder than in other SaaS categories. A vulnerability in the shared layer affects every customer. Data isolation must be absolute - a bug that leaks one customer's user data to another customer is a career-ending event. We invested heavily in tenant isolation testing, including automated tests that attempted cross-tenant data access on every deployment.
Global Data Residency
As we expanded internationally, data residency became non-negotiable. GDPR required European data to stay in Europe. Customer contracts specified data processing locations. Some countries had outright data localization laws.
| Region | Data Residency Requirement | Our Solution |
|---|---|---|
| European Union | GDPR - data processing in EU | Frankfurt and Dublin data centers |
| United States | Various state laws, customer preference | US East and US West data centers |
| India | Data localization for certain data types | Mumbai data center |
| Canada | PIPEDA, provincial requirements | Toronto data center |
| Australia | Privacy Act, data localization | Sydney data center |
We deployed in multiple regions early - this was one of the best early decisions we made. Companies that wait until a customer demands data residency find themselves in a months-long infrastructure project when they should be closing a deal.
Build for data residency before your customers demand it. The cost of deploying in additional regions proactively is a fraction of the cost of emergency deployment under customer pressure. In security, the question is not whether you will need regional data residency - it is when.
Performance at Scale
At one billion identities, every millisecond matters. Our authentication system processes up to 150,000 login requests per second at peak. The performance architecture that enabled this:
Performance Architecture
=========================
Request Flow:
User login request
|
v
Global CDN / Edge Network
(TLS termination, DDoS mitigation)
|
v
Load Balancer (regional)
|
v
Authentication Service Cluster
(horizontally scaled, stateless)
|
v
Cache Layer (distributed)
(session data, user profiles,
configuration, rate limits)
|
v
Database Cluster (regional)
(only on cache miss or write)
Key Metrics:
- P50 latency: <50ms
- P99 latency: <200ms
- Peak throughput: 150,000 req/sec
- Uptime SLA: 99.99%
Stateless services. Every authentication service instance is stateless. User sessions and state live in a distributed cache layer. This lets us scale horizontally by adding instances without coordination.
Aggressive caching. User profile data, configuration, and session data are cached with carefully tuned TTLs. At our scale, the difference between a cache hit and a database read is the difference between 5ms and 50ms - and that compounds across millions of requests.
Read-heavy optimization. Authentication is overwhelmingly read-heavy (logins) versus write-heavy (registrations, profile updates). We optimized the read path aggressively with read replicas, caching, and denormalized data models.
Building the Team
Scaling from a small founding team to an organization capable of managing a billion-user platform required deliberate team building. The skills needed at 100 customers are different from the skills needed at 10,000.
The Hiring Evolution
| Stage | Team Size | Key Hires | Why |
|---|---|---|---|
| 0-100 customers | 5-10 | Generalist engineers | Need people who can build anything |
| 100-500 customers | 15-30 | SRE, Security engineer, first AE | Reliability and sales become critical |
| 500-2000 customers | 30-60 | VP Engineering, CS team, DevOps | Need leadership and customer operations |
| 2000+ customers | 60-100+ | CISO, compliance team, regional leads | Governance, global operations |
The Hardest Hires in Security
Security engineers who can also build product. The intersection of security expertise and product engineering skills is tiny. Most security engineers want to break things, not build products. Finding people who can do both is one of the hardest hiring challenges in the industry.
Enterprise account executives who understand security. Enterprise sales in security requires AEs who can hold technical conversations with CISOs, navigate security evaluations, and speak the language of risk. These people are rare and expensive.
Customer success managers with security domain knowledge. A CSM who does not understand authentication, identity federation, or compliance requirements cannot effectively support security customers. We had to train CSMs extensively on our domain.
The biggest hiring mistake we made was hiring for generic SaaS experience instead of security domain expertise. A VP of Sales who crushed it at a marketing SaaS company struggled in security because the buying process, the stakeholder map, and the value proposition are fundamentally different. Domain expertise trumps general experience in security.
Operations at 1B+ Identities
Running a platform that manages over a billion identities is an operational discipline as much as a technical one. Several operational capabilities became critical as we scaled.
Incident Response
At billion-user scale, every incident is amplified. A 5-minute outage affects millions of login attempts. A security vulnerability could expose billions of records. We built incident response as a core competency:
Incident Response Framework
==============================
Severity Levels:
SEV 1: Platform-wide outage or security breach
Response: Immediate, all hands
Comms: Customer notification within 1 hour
Postmortem: Mandatory within 24 hours
SEV 2: Degraded performance or partial outage
Response: On-call team, escalation to leads
Comms: Status page update within 30 minutes
Postmortem: Mandatory within 48 hours
SEV 3: Minor issue, no customer impact
Response: On-call team, fix during business hours
Comms: Internal only
Postmortem: Optional
Key Metrics:
- MTTD (Mean Time to Detect): <5 minutes
- MTTR (Mean Time to Respond): <15 minutes
- MTTR (Mean Time to Resolve): <2 hours (SEV 1)
Compliance at Scale
As our customer base grew, so did our compliance requirements. Different customers required different certifications and audit evidence:
| Certification | What It Covers | Effort to Maintain |
|---|---|---|
| SOC 2 Type II | Security, availability, processing integrity | Annual audit, continuous controls |
| ISO 27001 | Information security management | Annual audit, management review |
| GDPR compliance | EU data protection | DPO, data processing agreements, privacy impact assessments |
| HIPAA | Healthcare data protection | Annual risk assessment, BAA with customers |
| PCI DSS | Payment card data security | Quarterly scans, annual assessment |
The compliance burden grows non-linearly. Each new certification adds ongoing maintenance, audit preparation, and documentation requirements. At some point, you need a dedicated compliance team - not security engineers doubling as compliance managers.
The CTO-to-CISO Dual Role
Technical founders of security companies often find themselves wearing two hats: CTO (building the product) and de facto CISO (ensuring the security of their own infrastructure). These roles have fundamentally different objectives:
| Dimension | CTO Perspective | CISO Perspective |
|---|---|---|
| Speed | Ship fast, iterate quickly | Move carefully, evaluate risks |
| Features | Add capabilities | Minimize attack surface |
| Architecture | Optimize for performance | Optimize for security |
| Access | Enable developer productivity | Restrict access to need-only |
| Dependencies | Use best tools available | Minimize third-party risk |
These perspectives often conflict. The CTO wants to adopt the latest database technology. The CISO wants to use battle-tested, audited solutions. The CTO wants developers to have broad access for debugging. The CISO wants least-privilege access with full audit trails.
If you are a technical founder running a security company, eventually you must split the CTO and CISO roles. One person cannot effectively optimize for both speed and security. The conflicts between these roles require separate decision-makers with equal organizational authority. The longer this split is delayed, the more likely it is to create tension across the engineering organization.
International Expansion Challenges
Expanding LoginRadius internationally introduced challenges beyond data residency:
Regulatory fragmentation. Every country has different privacy laws, data protection requirements, and authentication standards. What is compliant in the US may be illegal in Germany. What is acceptable in India may not satisfy Canadian requirements.
Localization of security features. Authentication experiences need localization beyond language translation. Phone number formats for SMS OTP, national ID verification requirements, regional social login preferences (WeChat in China, Line in Japan) all require country-specific implementation.
Support across time zones. Enterprise security customers expect responsive support. A customer experiencing an authentication outage at 3 AM their time cannot wait until your engineering team wakes up. We built follow-the-sun support before we could comfortably afford it.
Currency and pricing. Enterprise pricing in different markets requires understanding local purchasing power, competitive dynamics, and procurement norms. A pricing structure that works in the US may be non-competitive in India or over-discounted in Northern Europe.
| Expansion Challenge | Our Approach |
|---|---|
| Regulatory compliance | Hired regional compliance advisors in EU, India, and Australia |
| Data residency | Pre-deployed infrastructure in 5 regions |
| Localization | Built extensible authentication UX supporting 30+ languages |
| Time zone support | Follow-the-sun support team across US, India, and Australia |
| Pricing | Regional pricing tiers based on market analysis |
The Lessons of Scale
Scaling to a billion users taught us lessons that could not be learned at smaller scale:
Reliability is a feature. At small scale, an outage is an inconvenience. At billion-user scale, an outage is a crisis that makes headlines. We invested more in reliability engineering than in new features for extended periods. That felt wrong at the time but was exactly right.
Compliance is a competitive advantage. Companies often view compliance as a cost center. At scale, our comprehensive compliance portfolio became one of our strongest sales differentiators. Enterprise customers chose us specifically because we had the certifications their compliance teams required.
Simplicity scales, complexity breaks. The architectural decisions that scaled best were the simplest ones. Stateless services, clean data isolation, and aggressive caching are not clever - they are boring and reliable. The clever architectural decisions were the ones that caused the most incidents.
Culture determines operational quality. We could not hire enough engineers to personally monitor every system. Culture - specifically a culture where every engineer felt personally responsible for the reliability and security of the platform - was what kept the system running. Blameless postmortems, shared on-call responsibilities, and celebrating reliability metrics built that culture.
For a deep dive into authentication architecture at scale, see The Complete Guide to Authentication Implementation for Modern Applications.
The next chapter covers the unique dynamics of selling security to enterprises - where your product IS the thing they are evaluating for security.