Deploying Agents to Production
You built an agent. It works on your laptop. Your demo wowed the team. Now comes the part nobody warns you about: making it work reliably for thousands of users, 24/7, without burning through your entire cloud budget in a week.
The gap between a demo agent and a production agent is enormous - arguably bigger than the gap for traditional software. Traditional apps have predictable execution paths. Agents don't. They make decisions, call tools in unpredictable orders, and sometimes go on expensive tangents. Deploying them requires a fundamentally different mindset.
This chapter is your field guide to crossing that gap. We'll cover architecture, scaling, observability, cost control, and everything else you need to sleep soundly while your agent handles real traffic.
The Demo-to-Production Gap
Let's be honest about what your demo agent is missing:
| Aspect | Demo | Production |
|---|---|---|
| Concurrency | 1 user at a time | Hundreds or thousands simultaneous |
| Error handling | try/catch, maybe a retry |
Dead letter queues, circuit breakers, graceful degradation |
| State management | In-memory dict | Redis, PostgreSQL, distributed state |
| Cost control | "We'll optimize later" | Per-request budgets, model routing, caching |
| Observability | print() statements |
Structured logging, distributed tracing, alerting |
| Latency | "It's fast enough" | P95 latency targets, streaming, async execution |
| Security | Hardcoded API keys | Secrets management, network isolation, auth layers |
| Testing | Manual "does it look right?" | Automated eval suites, regression tests, canary deployments |
If you're reading this list and thinking "we only need half of these," you're probably wrong. Production has a way of punishing optimism.
Architecture for Production Agents
A production agent system is not a single process. It's a distributed system with several distinct layers:
┌─────────────────────────────────────────────────────┐
│ API Gateway │
│ (Authentication, Rate Limiting) │
├─────────────────────────────────────────────────────┤
│ API Layer │
│ (Request validation, session management) │
├─────────────────────────────────────────────────────┤
│ Task Queue (Redis / SQS) │
├──────────┬──────────┬──────────┬────────────────────┤
│ Worker 1 │ Worker 2 │ Worker 3 │ ... Worker N │
│ (Agent) │ (Agent) │ (Agent) │ (Agent) │
├──────────┴──────────┴──────────┴────────────────────┤
│ Shared State (Redis + PostgreSQL) │
├─────────────────────────────────────────────────────┤
│ Tool Services (APIs, DBs, External Services) │
└─────────────────────────────────────────────────────┘
The API Layer receives requests, authenticates users, validates input, and enqueues work. It should never run agent logic directly - that's a recipe for timeouts and resource exhaustion.
The Task Queue decouples request acceptance from processing. When a user sends a message, the API layer returns immediately with a task ID. The agent work happens asynchronously in workers.
Workers are the processes that actually run your agents. They pull tasks from the queue, execute the agent loop, and store results. Each worker runs one agent task at a time (usually), which makes scaling straightforward.
Shared State is where conversation history, tool results, and agent memory live. This must be accessible to any worker, because you can't guarantee which worker will handle which request.
Start with a simple architecture. Many teams over-engineer from day one. If you have fewer than 100 concurrent users, a single API server with a Redis queue and a few workers is plenty. Scale when you have data showing you need to.
Containerization: Packaging Agents with Docker
Agents have more dependencies than typical web apps - model client libraries, tool-specific packages, possibly system-level binaries for things like PDF parsing or browser automation. Docker is the sane way to manage this.
Here's a production-grade Dockerfile for a Python-based agent:
# Stage 1: Build dependencies
FROM python:3.12-slim AS builder
WORKDIR /app
# Install system dependencies needed for building
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
&& rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt
# Stage 2: Runtime
FROM python:3.12-slim AS runtime
WORKDIR /app
# Install only runtime system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
curl \
&& rm -rf /var/lib/apt/lists/*
# Copy installed Python packages from builder
COPY --from=builder /install /usr/local
# Copy application code
COPY src/ ./src/
COPY config/ ./config/
# Create non-root user
RUN useradd -m -r agentuser && chown -R agentuser:agentuser /app
USER agentuser
# Health check
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
CMD curl -f http://localhost:8080/health || exit 1
ENV PYTHONUNBUFFERED=1
ENV PYTHONDONTWRITEBYTECODE=1
ENTRYPOINT ["python", "-m", "src.worker"]
A few things worth noting:
- Multi-stage build keeps the final image small. Build tools don't ship to production.
- Non-root user is non-negotiable for security. Never run agents as root.
- Health check lets your orchestrator know when a worker is stuck. Agents can hang - an LLM call might never return, a tool might deadlock. Health checks catch this.
PYTHONUNBUFFEREDensures logs appear immediately, which matters enormously when debugging agent behavior.
For the API server, you'd have a separate Dockerfile (or a different entrypoint in the same image):
ENTRYPOINT ["uvicorn", "src.api:app", "--host", "0.0.0.0", "--port", "8080"]
If your agent uses browser automation tools (Playwright, Selenium), you'll need a significantly larger base image with Chromium. Consider running browser tools as a separate microservice to keep your main agent image lean.
Scaling Strategies
Agents are weird to scale because their resource usage is bursty and unpredictable. A simple question might take 2 seconds and one LLM call. A complex task might take 90 seconds and fifteen LLM calls with tool use.
Horizontal Scaling with Queue-Based Processing
The most reliable pattern is queue-based processing with auto-scaling workers:
# worker.py - simplified production worker
import asyncio
import redis.asyncio as redis
import json
from agent import Agent
from config import settings
pool = redis.ConnectionPool.from_url(settings.REDIS_URL)
async def process_task(task_data: dict):
agent = Agent(
model=task_data.get("model", "claude-sonnet-4-20250514"),
max_steps=task_data.get("max_steps", 25),
timeout=task_data.get("timeout", 120),
)
try:
result = await agent.run(
messages=task_data["messages"],
tools=task_data["tools"],
session_id=task_data["session_id"],
)
await store_result(task_data["task_id"], result)
except Exception as e:
await store_error(task_data["task_id"], e)
await maybe_retry(task_data)
async def worker_loop():
r = redis.Redis(connection_pool=pool)
while True:
# BRPOP blocks until a task is available (or timeout)
_, raw = await r.brpop("agent:tasks", timeout=5)
if raw:
task = json.loads(raw)
await process_task(task)
if __name__ == "__main__":
asyncio.run(worker_loop())
Scale workers based on queue depth. Most cloud platforms support this natively:
- AWS: ECS tasks scaling on SQS queue depth via CloudWatch alarms
- GCP: Cloud Run jobs scaling on Pub/Sub message count
- Kubernetes: KEDA (Kubernetes Event-Driven Autoscaling) watching Redis list length
Load Balancing Considerations
Standard round-robin load balancing works for the API layer, but think carefully about sticky sessions if your agent maintains conversational state. Options:
- Stateless workers + shared state (recommended): Any worker can handle any request. State lives in Redis/PostgreSQL.
- Sticky sessions: Route all requests from one conversation to the same worker. Simpler code, but harder to scale and recover from failures.
- Hybrid: Keep hot state (current turn) in-worker, persist cold state (history) to shared storage.
State Management in Production
Agents are stateful creatures. They maintain conversation history, tool results, memory, and sometimes long-running context. Here's how to manage this:
| State Type | Storage | TTL | Example |
|---|---|---|---|
| Conversation history | PostgreSQL | Days to months | Full message history for a session |
| Current turn context | Redis | Minutes to hours | Active agent loop state, tool call results |
| Agent memory | PostgreSQL + vector DB | Long-term | Learned facts, user preferences |
| Task queue | Redis / SQS | Minutes | Pending agent tasks |
| Rate limiting | Redis | Seconds to minutes | Per-user request counts |
| Cache | Redis | Hours to days | Repeated query results, embeddings |
# State management pattern
class AgentStateManager:
def __init__(self, redis_client, pg_pool):
self.redis = redis_client
self.pg = pg_pool
async def save_turn(self, session_id: str, messages: list):
"""Save current turn to Redis for fast access."""
key = f"session:{session_id}:current"
await self.redis.setex(key, 3600, json.dumps(messages))
async def persist_history(self, session_id: str, messages: list):
"""Persist full history to PostgreSQL."""
async with self.pg.acquire() as conn:
await conn.execute(
"""INSERT INTO conversation_history
(session_id, messages, updated_at)
VALUES ($1, $2, NOW())
ON CONFLICT (session_id)
DO UPDATE SET messages = $2, updated_at = NOW()""",
session_id, json.dumps(messages)
)
async def get_context(self, session_id: str) -> list:
"""Try Redis first, fall back to PostgreSQL."""
cached = await self.redis.get(f"session:{session_id}:current")
if cached:
return json.loads(cached)
async with self.pg.acquire() as conn:
row = await conn.fetchrow(
"SELECT messages FROM conversation_history WHERE session_id = $1",
session_id
)
return json.loads(row["messages"]) if row else []
Observability: Seeing What Your Agent Does
Traditional apps log request/response. Agents need much more. Each agent run is a tree of decisions, tool calls, and model invocations. You need to see all of it.
Structured Logging
Every log line should be a JSON object with consistent fields:
import structlog
logger = structlog.get_logger()
async def run_agent_step(session_id, step_num, message):
logger.info(
"agent_step_start",
session_id=session_id,
step=step_num,
input_tokens=count_tokens(message),
)
result = await llm.generate(message)
logger.info(
"agent_step_complete",
session_id=session_id,
step=step_num,
output_tokens=result.usage.output_tokens,
model=result.model,
tool_calls=len(result.tool_calls),
latency_ms=result.latency_ms,
cost_usd=calculate_cost(result.usage),
)
Distributed Tracing
Use OpenTelemetry to trace the full lifecycle of an agent run. Each tool call, each LLM invocation, each database query becomes a span in a trace:
from opentelemetry import trace
tracer = trace.get_tracer("agent-service")
async def run_agent(session_id: str, user_message: str):
with tracer.start_as_current_span("agent_run") as span:
span.set_attribute("session_id", session_id)
for step in range(max_steps):
with tracer.start_as_current_span(f"step_{step}"):
response = await call_llm(messages)
if response.tool_calls:
for tool_call in response.tool_calls:
with tracer.start_as_current_span(f"tool_{tool_call.name}"):
result = await execute_tool(tool_call)
This gives you a waterfall view of every agent run - invaluable when debugging why a request took 45 seconds or cost $0.50.
Key Metrics Dashboard
Build a dashboard with these metrics from day one:
| Metric | Why It Matters | Alert Threshold |
|---|---|---|
| P50/P95/P99 latency | User experience | P95 > 30s |
| Cost per request | Budget control | > $0.10/request |
| Steps per completion | Agent efficiency | Avg > 10 steps |
| Tool call success rate | Tool reliability | < 95% |
| LLM error rate | Provider reliability | > 2% |
| Queue depth | Capacity planning | > 100 pending |
| Task completion rate | Agent effectiveness | < 90% |
| Tokens per request | Cost trending | Sudden spikes |
Error Handling at Scale
Agents fail in creative ways. The LLM hallucinates a tool name. An API returns a 500. The model provider rate-limits you. A tool execution hangs forever. You need defense in depth.
Retry Strategy
Not all errors are retryable. Be specific:
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception
def is_retryable(exception):
"""Only retry transient errors."""
if isinstance(exception, RateLimitError):
return True
if isinstance(exception, APIConnectionError):
return True
if isinstance(exception, TimeoutError):
return True
# Don't retry validation errors, auth errors, etc.
return False
@retry(
retry=retry_if_exception(is_retryable),
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=30),
)
async def call_llm_with_retry(messages, model):
return await llm.generate(messages=messages, model=model)
Dead Letter Queues
When a task fails all retries, don't lose it. Send it to a dead letter queue for human review:
async def maybe_retry(task_data: dict):
retries = task_data.get("retry_count", 0)
if retries < MAX_RETRIES:
task_data["retry_count"] = retries + 1
await redis.lpush("agent:tasks", json.dumps(task_data))
else:
# Send to dead letter queue
await redis.lpush("agent:dead_letter", json.dumps({
**task_data,
"failed_at": datetime.utcnow().isoformat(),
"last_error": task_data.get("last_error"),
}))
await alert_ops_team(task_data)
Graceful Degradation
When things go wrong, degrade gracefully instead of failing completely:
- Model fallback: If Claude Opus times out, retry with Sonnet. If Sonnet is down, fall back to Haiku with a simplified prompt.
- Tool fallback: If the primary search API is down, use a cached result or skip the search step and tell the user.
- Budget exhaustion: If a request exceeds its cost budget, stop the agent loop and return the best partial result.
The best production agents have a "minimum viable response" - the simplest useful answer they can give when everything else fails. Even if all tools are down and the expensive model is unavailable, returning a helpful message beats returning a 500 error.
Cost Optimization
LLM costs can be staggering at scale. A single complex agent run might cost $0.50 or more. Multiply that by thousands of daily requests and you're looking at serious money.
Model Routing
The single most impactful optimization. Not every request needs your most powerful model:
class ModelRouter:
async def select_model(self, message: str, context: dict) -> str:
complexity = await self.estimate_complexity(message)
if complexity == "simple":
# Greetings, FAQ, simple lookups
return "claude-haiku"
elif complexity == "moderate":
# Most tool-use tasks, standard queries
return "claude-sonnet"
else:
# Multi-step reasoning, complex analysis
return "claude-opus"
async def estimate_complexity(self, message: str) -> str:
# Use a fast classifier - could be a small model,
# keyword heuristics, or even regex
indicators = {
"simple": ["hello", "thanks", "what is", "define"],
"complex": ["analyze", "compare", "debug", "refactor",
"write a comprehensive", "step by step"],
}
# ... classification logic
Real-world impact: routing 60% of traffic to Haiku instead of Sonnet can cut costs by 80% on those requests with minimal quality loss.
Response Caching
Cache at multiple levels:
async def get_or_generate(prompt_hash: str, generate_fn):
# L1: Check exact match cache
cached = await redis.get(f"cache:exact:{prompt_hash}")
if cached:
return json.loads(cached)
# L2: Check semantic cache (for similar queries)
similar = await vector_db.search(prompt_embedding, threshold=0.95)
if similar:
return similar[0].response
# Cache miss: generate and cache
result = await generate_fn()
await redis.setex(f"cache:exact:{prompt_hash}", 3600, json.dumps(result))
return result
Batching
If you're processing many independent requests, batch them:
# Instead of N individual API calls for embeddings:
embeddings = await client.embeddings.create(
input=all_texts, # Send all at once
model="text-embedding-3-small"
)
Latency Optimization
Users hate waiting. Here's how to make agents feel fast:
Streaming Responses
Stream the agent's final response as it's generated. Don't wait for the full response:
async def stream_agent_response(session_id: str, message: str):
agent = Agent(model="claude-sonnet-4-20250514")
async for event in agent.run_stream(message):
if event.type == "thinking":
yield {"type": "status", "text": "Analyzing your request..."}
elif event.type == "tool_call":
yield {"type": "status", "text": f"Using {event.tool_name}..."}
elif event.type == "text_delta":
yield {"type": "content", "text": event.text}
This transforms a 20-second wait into a responsive experience where the user sees progress the entire time.
Parallel Tool Execution
When the agent calls multiple tools that don't depend on each other, run them in parallel:
async def execute_tools(tool_calls: list) -> list:
# Group independent calls
tasks = [execute_single_tool(tc) for tc in tool_calls]
results = await asyncio.gather(*tasks, return_exceptions=True)
return results
This alone can cut agent latency by 40-60% in multi-tool scenarios.
CI/CD for Agents
Agents need a different testing and deployment strategy than traditional software.
Testing Strategy
Unit Tests → Individual tool functions, prompt formatting
Integration Tests → Agent + real tools against test fixtures
Eval Suite → Agent against a benchmark of 50-200 test cases
with automated scoring (accuracy, cost, latency)
Shadow Mode → Agent runs against production traffic but
results are logged, not served
Canary Deployment → 5% of traffic goes to new version,
compare metrics against baseline
Full Rollout → Gradual ramp: 5% → 25% → 50% → 100%
Canary Deployments
# Kubernetes canary with Argo Rollouts
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: agent-worker
spec:
replicas: 10
strategy:
canary:
steps:
- setWeight: 5
- pause: { duration: 30m }
- analysis:
templates:
- templateName: agent-success-rate
- setWeight: 25
- pause: { duration: 1h }
- setWeight: 50
- pause: { duration: 2h }
- setWeight: 100
The analysis step is critical. It checks whether the canary's success rate, latency, and cost are within acceptable bounds before proceeding.
Rollback Triggers
Auto-rollback when:
- Task completion rate drops below 85%
- P95 latency exceeds 2x baseline
- Average cost per request exceeds 1.5x baseline
- Error rate exceeds 5%
Infrastructure Comparison
Choosing the right compute platform matters:
| Factor | Serverless (Lambda/Cloud Functions) | Containers (ECS/Cloud Run) | VMs (EC2/Compute Engine) |
|---|---|---|---|
| Cold start | 1-10s (problematic for agents) | Minimal with min instances | None |
| Max execution time | 15 min (Lambda) | Unlimited | Unlimited |
| Scaling speed | Seconds | 30s-2min | Minutes |
| Cost at low volume | Very low | Low-moderate | High (always running) |
| Cost at high volume | Can be expensive | Moderate | Lowest per-unit |
| Complexity | Low | Moderate | High |
| State management | External only | External recommended | Local possible |
| Best for agents? | Simple, short tasks | Most agent workloads | Heavy, long-running agents |
Containers (ECS/Cloud Run/Kubernetes) are the sweet spot for most agent workloads. They handle the long execution times agents need, scale reasonably fast, and don't have the cold start problems of serverless. Use serverless only for lightweight agent tasks that complete in under a minute.
Real-World Deployment: Customer Support Agent on AWS
Let's walk through deploying a customer support agent on AWS. This isn't theoretical - it reflects a pattern I've seen work at scale.
Architecture
Route 53 (DNS)
→ ALB (Application Load Balancer)
→ ECS Service: API (Fargate, 2-10 tasks)
→ SQS Queue (agent tasks)
→ ECS Service: Workers (Fargate, 2-50 tasks)
→ ElastiCache Redis (state + cache)
→ RDS PostgreSQL (history + analytics)
→ Secrets Manager (API keys)
→ CloudWatch (logs + metrics)
→ S3 (conversation archives)
Key Configuration
# config.py - production settings
class ProductionConfig:
# Model settings
PRIMARY_MODEL = "claude-sonnet-4-20250514"
FALLBACK_MODEL = "claude-haiku"
MAX_STEPS = 15
STEP_TIMEOUT = 30 # seconds per step
TOTAL_TIMEOUT = 180 # seconds per request
MAX_COST_PER_REQUEST = 0.25 # USD
# Scaling
MIN_WORKERS = 2
MAX_WORKERS = 50
SCALE_UP_THRESHOLD = 10 # queue depth
SCALE_DOWN_THRESHOLD = 2
# Reliability
MAX_RETRIES = 3
CIRCUIT_BREAKER_THRESHOLD = 5 # failures before opening
CIRCUIT_BREAKER_TIMEOUT = 60 # seconds before half-open
# Caching
RESPONSE_CACHE_TTL = 3600
EMBEDDING_CACHE_TTL = 86400
What This Looks Like Running
With this setup, a typical request flow:
- User sends message via WebSocket or HTTP
- API server authenticates, validates, enqueues to SQS (50ms)
- Worker picks up task, loads conversation history from Redis (10ms)
- Agent runs 3-5 steps with tool calls (5-15 seconds)
- Response streamed back via WebSocket or returned via HTTP
- Conversation history persisted to PostgreSQL (async, 20ms)
- Metrics emitted to CloudWatch
Cost at moderate scale (10,000 conversations/day): roughly $500-1,500/month in LLM costs plus $200-400 in infrastructure.
Security in Production
Agent security deserves its own chapter (and it gets one earlier in this book), but here are the production-specific concerns:
API Authentication
Every request to your agent API must be authenticated. Use short-lived tokens, not API keys:
from fastapi import Depends, HTTPException
from fastapi.security import HTTPBearer
security = HTTPBearer()
async def verify_token(credentials = Depends(security)):
token = credentials.credentials
try:
payload = jwt.decode(token, SECRET_KEY, algorithms=["HS256"])
return payload
except jwt.InvalidTokenError:
raise HTTPException(status_code=401, detail="Invalid token")
Secrets Management
Never hardcode API keys. Never put them in environment variables in your Dockerfile. Use your cloud provider's secrets manager:
import boto3
def get_api_key(secret_name: str) -> str:
client = boto3.client("secretsmanager")
response = client.get_secret_value(SecretId=secret_name)
return json.loads(response["SecretString"])["api_key"]
# Cache the secret in memory - don't fetch on every request
ANTHROPIC_API_KEY = get_api_key("prod/anthropic-api-key")
Network Isolation
Your agent workers should live in a private subnet with no direct internet access. They reach external APIs through a NAT gateway or VPC endpoints. This limits the blast radius if an agent is compromised.
VPC
├── Public Subnet
│ ├── ALB
│ └── NAT Gateway
├── Private Subnet
│ ├── API Service (ECS)
│ ├── Worker Service (ECS)
│ ├── ElastiCache
│ └── RDS
Agents that execute code or use browser tools present elevated security risks. Run these tools in isolated sandboxes - separate containers, restricted network access, ephemeral filesystems. Treat every tool execution as potentially hostile input.
Deployment Checklist
Before you push to production, run through this checklist:
- All API keys in secrets manager, not environment variables or code
- Health checks configured for all services
- Auto-scaling configured with sensible min/max bounds
- Dead letter queue configured for failed tasks
- Structured logging with correlation IDs (session_id, task_id)
- Cost alerts set at 50%, 80%, and 100% of budget
- Latency alerts on P95
- Error rate alerts with auto-rollback
- Eval suite passing at >= 90% accuracy
- Rate limiting configured per user and globally
- Network isolation configured (private subnets for workers)
- Backup and restore tested for state databases
- Runbook written for common failure scenarios
- On-call rotation established
Production deployment isn't a one-time event. It's an ongoing practice. Your agent will behave differently at scale than it did in testing, and you'll need to iterate on prompts, tools, and architecture as you learn from real traffic. The infrastructure in this chapter gives you the foundation to do that iteration safely.
In the next chapter, we'll look at real-world use cases - how teams across different industries have built and deployed agents successfully, and what lessons they've learned along the way.