Skip to content

Deploying Agents to Production

You built an agent. It works on your laptop. Your demo wowed the team. Now comes the part nobody warns you about: making it work reliably for thousands of users, 24/7, without burning through your entire cloud budget in a week.

The gap between a demo agent and a production agent is enormous - arguably bigger than the gap for traditional software. Traditional apps have predictable execution paths. Agents don't. They make decisions, call tools in unpredictable orders, and sometimes go on expensive tangents. Deploying them requires a fundamentally different mindset.

This chapter is your field guide to crossing that gap. We'll cover architecture, scaling, observability, cost control, and everything else you need to sleep soundly while your agent handles real traffic.

The Demo-to-Production Gap

Let's be honest about what your demo agent is missing:

Aspect Demo Production
Concurrency 1 user at a time Hundreds or thousands simultaneous
Error handling try/catch, maybe a retry Dead letter queues, circuit breakers, graceful degradation
State management In-memory dict Redis, PostgreSQL, distributed state
Cost control "We'll optimize later" Per-request budgets, model routing, caching
Observability print() statements Structured logging, distributed tracing, alerting
Latency "It's fast enough" P95 latency targets, streaming, async execution
Security Hardcoded API keys Secrets management, network isolation, auth layers
Testing Manual "does it look right?" Automated eval suites, regression tests, canary deployments

If you're reading this list and thinking "we only need half of these," you're probably wrong. Production has a way of punishing optimism.

Architecture for Production Agents

A production agent system is not a single process. It's a distributed system with several distinct layers:

┌─────────────────────────────────────────────────────┐
│                   API Gateway                        │
│         (Authentication, Rate Limiting)              │
├─────────────────────────────────────────────────────┤
│                  API Layer                           │
│     (Request validation, session management)         │
├─────────────────────────────────────────────────────┤
│              Task Queue (Redis / SQS)                │
├──────────┬──────────┬──────────┬────────────────────┤
│ Worker 1 │ Worker 2 │ Worker 3 │  ... Worker N      │
│ (Agent)  │ (Agent)  │ (Agent)  │  (Agent)           │
├──────────┴──────────┴──────────┴────────────────────┤
│           Shared State (Redis + PostgreSQL)          │
├─────────────────────────────────────────────────────┤
│     Tool Services (APIs, DBs, External Services)    │
└─────────────────────────────────────────────────────┘

The API Layer receives requests, authenticates users, validates input, and enqueues work. It should never run agent logic directly - that's a recipe for timeouts and resource exhaustion.

The Task Queue decouples request acceptance from processing. When a user sends a message, the API layer returns immediately with a task ID. The agent work happens asynchronously in workers.

Workers are the processes that actually run your agents. They pull tasks from the queue, execute the agent loop, and store results. Each worker runs one agent task at a time (usually), which makes scaling straightforward.

Shared State is where conversation history, tool results, and agent memory live. This must be accessible to any worker, because you can't guarantee which worker will handle which request.

Tip

Start with a simple architecture. Many teams over-engineer from day one. If you have fewer than 100 concurrent users, a single API server with a Redis queue and a few workers is plenty. Scale when you have data showing you need to.

Containerization: Packaging Agents with Docker

Agents have more dependencies than typical web apps - model client libraries, tool-specific packages, possibly system-level binaries for things like PDF parsing or browser automation. Docker is the sane way to manage this.

Here's a production-grade Dockerfile for a Python-based agent:

# Stage 1: Build dependencies
FROM python:3.12-slim AS builder

WORKDIR /app

# Install system dependencies needed for building
RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt

# Stage 2: Runtime
FROM python:3.12-slim AS runtime

WORKDIR /app

# Install only runtime system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Copy installed Python packages from builder
COPY --from=builder /install /usr/local

# Copy application code
COPY src/ ./src/
COPY config/ ./config/

# Create non-root user
RUN useradd -m -r agentuser && chown -R agentuser:agentuser /app
USER agentuser

# Health check
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
    CMD curl -f http://localhost:8080/health || exit 1

ENV PYTHONUNBUFFERED=1
ENV PYTHONDONTWRITEBYTECODE=1

ENTRYPOINT ["python", "-m", "src.worker"]

A few things worth noting:

  • Multi-stage build keeps the final image small. Build tools don't ship to production.
  • Non-root user is non-negotiable for security. Never run agents as root.
  • Health check lets your orchestrator know when a worker is stuck. Agents can hang - an LLM call might never return, a tool might deadlock. Health checks catch this.
  • PYTHONUNBUFFERED ensures logs appear immediately, which matters enormously when debugging agent behavior.

For the API server, you'd have a separate Dockerfile (or a different entrypoint in the same image):

ENTRYPOINT ["uvicorn", "src.api:app", "--host", "0.0.0.0", "--port", "8080"]
Warning

If your agent uses browser automation tools (Playwright, Selenium), you'll need a significantly larger base image with Chromium. Consider running browser tools as a separate microservice to keep your main agent image lean.

Scaling Strategies

Agents are weird to scale because their resource usage is bursty and unpredictable. A simple question might take 2 seconds and one LLM call. A complex task might take 90 seconds and fifteen LLM calls with tool use.

Horizontal Scaling with Queue-Based Processing

The most reliable pattern is queue-based processing with auto-scaling workers:

# worker.py - simplified production worker
import asyncio
import redis.asyncio as redis
import json
from agent import Agent
from config import settings

pool = redis.ConnectionPool.from_url(settings.REDIS_URL)

async def process_task(task_data: dict):
    agent = Agent(
        model=task_data.get("model", "claude-sonnet-4-20250514"),
        max_steps=task_data.get("max_steps", 25),
        timeout=task_data.get("timeout", 120),
    )

    try:
        result = await agent.run(
            messages=task_data["messages"],
            tools=task_data["tools"],
            session_id=task_data["session_id"],
        )
        await store_result(task_data["task_id"], result)
    except Exception as e:
        await store_error(task_data["task_id"], e)
        await maybe_retry(task_data)

async def worker_loop():
    r = redis.Redis(connection_pool=pool)
    while True:
        # BRPOP blocks until a task is available (or timeout)
        _, raw = await r.brpop("agent:tasks", timeout=5)
        if raw:
            task = json.loads(raw)
            await process_task(task)

if __name__ == "__main__":
    asyncio.run(worker_loop())

Scale workers based on queue depth. Most cloud platforms support this natively:

  • AWS: ECS tasks scaling on SQS queue depth via CloudWatch alarms
  • GCP: Cloud Run jobs scaling on Pub/Sub message count
  • Kubernetes: KEDA (Kubernetes Event-Driven Autoscaling) watching Redis list length

Load Balancing Considerations

Standard round-robin load balancing works for the API layer, but think carefully about sticky sessions if your agent maintains conversational state. Options:

  1. Stateless workers + shared state (recommended): Any worker can handle any request. State lives in Redis/PostgreSQL.
  2. Sticky sessions: Route all requests from one conversation to the same worker. Simpler code, but harder to scale and recover from failures.
  3. Hybrid: Keep hot state (current turn) in-worker, persist cold state (history) to shared storage.

State Management in Production

Agents are stateful creatures. They maintain conversation history, tool results, memory, and sometimes long-running context. Here's how to manage this:

State Type Storage TTL Example
Conversation history PostgreSQL Days to months Full message history for a session
Current turn context Redis Minutes to hours Active agent loop state, tool call results
Agent memory PostgreSQL + vector DB Long-term Learned facts, user preferences
Task queue Redis / SQS Minutes Pending agent tasks
Rate limiting Redis Seconds to minutes Per-user request counts
Cache Redis Hours to days Repeated query results, embeddings
# State management pattern
class AgentStateManager:
    def __init__(self, redis_client, pg_pool):
        self.redis = redis_client
        self.pg = pg_pool

    async def save_turn(self, session_id: str, messages: list):
        """Save current turn to Redis for fast access."""
        key = f"session:{session_id}:current"
        await self.redis.setex(key, 3600, json.dumps(messages))

    async def persist_history(self, session_id: str, messages: list):
        """Persist full history to PostgreSQL."""
        async with self.pg.acquire() as conn:
            await conn.execute(
                """INSERT INTO conversation_history
                   (session_id, messages, updated_at)
                   VALUES ($1, $2, NOW())
                   ON CONFLICT (session_id)
                   DO UPDATE SET messages = $2, updated_at = NOW()""",
                session_id, json.dumps(messages)
            )

    async def get_context(self, session_id: str) -> list:
        """Try Redis first, fall back to PostgreSQL."""
        cached = await self.redis.get(f"session:{session_id}:current")
        if cached:
            return json.loads(cached)

        async with self.pg.acquire() as conn:
            row = await conn.fetchrow(
                "SELECT messages FROM conversation_history WHERE session_id = $1",
                session_id
            )
            return json.loads(row["messages"]) if row else []

Observability: Seeing What Your Agent Does

Traditional apps log request/response. Agents need much more. Each agent run is a tree of decisions, tool calls, and model invocations. You need to see all of it.

Structured Logging

Every log line should be a JSON object with consistent fields:

import structlog

logger = structlog.get_logger()

async def run_agent_step(session_id, step_num, message):
    logger.info(
        "agent_step_start",
        session_id=session_id,
        step=step_num,
        input_tokens=count_tokens(message),
    )

    result = await llm.generate(message)

    logger.info(
        "agent_step_complete",
        session_id=session_id,
        step=step_num,
        output_tokens=result.usage.output_tokens,
        model=result.model,
        tool_calls=len(result.tool_calls),
        latency_ms=result.latency_ms,
        cost_usd=calculate_cost(result.usage),
    )

Distributed Tracing

Use OpenTelemetry to trace the full lifecycle of an agent run. Each tool call, each LLM invocation, each database query becomes a span in a trace:

from opentelemetry import trace

tracer = trace.get_tracer("agent-service")

async def run_agent(session_id: str, user_message: str):
    with tracer.start_as_current_span("agent_run") as span:
        span.set_attribute("session_id", session_id)

        for step in range(max_steps):
            with tracer.start_as_current_span(f"step_{step}"):
                response = await call_llm(messages)

                if response.tool_calls:
                    for tool_call in response.tool_calls:
                        with tracer.start_as_current_span(f"tool_{tool_call.name}"):
                            result = await execute_tool(tool_call)

This gives you a waterfall view of every agent run - invaluable when debugging why a request took 45 seconds or cost $0.50.

Key Metrics Dashboard

Build a dashboard with these metrics from day one:

Metric Why It Matters Alert Threshold
P50/P95/P99 latency User experience P95 > 30s
Cost per request Budget control > $0.10/request
Steps per completion Agent efficiency Avg > 10 steps
Tool call success rate Tool reliability < 95%
LLM error rate Provider reliability > 2%
Queue depth Capacity planning > 100 pending
Task completion rate Agent effectiveness < 90%
Tokens per request Cost trending Sudden spikes

Error Handling at Scale

Agents fail in creative ways. The LLM hallucinates a tool name. An API returns a 500. The model provider rate-limits you. A tool execution hangs forever. You need defense in depth.

Retry Strategy

Not all errors are retryable. Be specific:

from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception

def is_retryable(exception):
    """Only retry transient errors."""
    if isinstance(exception, RateLimitError):
        return True
    if isinstance(exception, APIConnectionError):
        return True
    if isinstance(exception, TimeoutError):
        return True
    # Don't retry validation errors, auth errors, etc.
    return False

@retry(
    retry=retry_if_exception(is_retryable),
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=30),
)
async def call_llm_with_retry(messages, model):
    return await llm.generate(messages=messages, model=model)

Dead Letter Queues

When a task fails all retries, don't lose it. Send it to a dead letter queue for human review:

async def maybe_retry(task_data: dict):
    retries = task_data.get("retry_count", 0)
    if retries < MAX_RETRIES:
        task_data["retry_count"] = retries + 1
        await redis.lpush("agent:tasks", json.dumps(task_data))
    else:
        # Send to dead letter queue
        await redis.lpush("agent:dead_letter", json.dumps({
            **task_data,
            "failed_at": datetime.utcnow().isoformat(),
            "last_error": task_data.get("last_error"),
        }))
        await alert_ops_team(task_data)

Graceful Degradation

When things go wrong, degrade gracefully instead of failing completely:

  • Model fallback: If Claude Opus times out, retry with Sonnet. If Sonnet is down, fall back to Haiku with a simplified prompt.
  • Tool fallback: If the primary search API is down, use a cached result or skip the search step and tell the user.
  • Budget exhaustion: If a request exceeds its cost budget, stop the agent loop and return the best partial result.
Note

The best production agents have a "minimum viable response" - the simplest useful answer they can give when everything else fails. Even if all tools are down and the expensive model is unavailable, returning a helpful message beats returning a 500 error.

Cost Optimization

LLM costs can be staggering at scale. A single complex agent run might cost $0.50 or more. Multiply that by thousands of daily requests and you're looking at serious money.

Model Routing

The single most impactful optimization. Not every request needs your most powerful model:

class ModelRouter:
    async def select_model(self, message: str, context: dict) -> str:
        complexity = await self.estimate_complexity(message)

        if complexity == "simple":
            # Greetings, FAQ, simple lookups
            return "claude-haiku"
        elif complexity == "moderate":
            # Most tool-use tasks, standard queries
            return "claude-sonnet"
        else:
            # Multi-step reasoning, complex analysis
            return "claude-opus"

    async def estimate_complexity(self, message: str) -> str:
        # Use a fast classifier - could be a small model,
        # keyword heuristics, or even regex
        indicators = {
            "simple": ["hello", "thanks", "what is", "define"],
            "complex": ["analyze", "compare", "debug", "refactor",
                       "write a comprehensive", "step by step"],
        }
        # ... classification logic

Real-world impact: routing 60% of traffic to Haiku instead of Sonnet can cut costs by 80% on those requests with minimal quality loss.

Response Caching

Cache at multiple levels:

async def get_or_generate(prompt_hash: str, generate_fn):
    # L1: Check exact match cache
    cached = await redis.get(f"cache:exact:{prompt_hash}")
    if cached:
        return json.loads(cached)

    # L2: Check semantic cache (for similar queries)
    similar = await vector_db.search(prompt_embedding, threshold=0.95)
    if similar:
        return similar[0].response

    # Cache miss: generate and cache
    result = await generate_fn()
    await redis.setex(f"cache:exact:{prompt_hash}", 3600, json.dumps(result))
    return result

Batching

If you're processing many independent requests, batch them:

# Instead of N individual API calls for embeddings:
embeddings = await client.embeddings.create(
    input=all_texts,  # Send all at once
    model="text-embedding-3-small"
)

Latency Optimization

Users hate waiting. Here's how to make agents feel fast:

Streaming Responses

Stream the agent's final response as it's generated. Don't wait for the full response:

async def stream_agent_response(session_id: str, message: str):
    agent = Agent(model="claude-sonnet-4-20250514")

    async for event in agent.run_stream(message):
        if event.type == "thinking":
            yield {"type": "status", "text": "Analyzing your request..."}
        elif event.type == "tool_call":
            yield {"type": "status", "text": f"Using {event.tool_name}..."}
        elif event.type == "text_delta":
            yield {"type": "content", "text": event.text}

This transforms a 20-second wait into a responsive experience where the user sees progress the entire time.

Parallel Tool Execution

When the agent calls multiple tools that don't depend on each other, run them in parallel:

async def execute_tools(tool_calls: list) -> list:
    # Group independent calls
    tasks = [execute_single_tool(tc) for tc in tool_calls]
    results = await asyncio.gather(*tasks, return_exceptions=True)
    return results

This alone can cut agent latency by 40-60% in multi-tool scenarios.

CI/CD for Agents

Agents need a different testing and deployment strategy than traditional software.

Testing Strategy

Unit Tests         → Individual tool functions, prompt formatting
Integration Tests  → Agent + real tools against test fixtures
Eval Suite         → Agent against a benchmark of 50-200 test cases
                     with automated scoring (accuracy, cost, latency)
Shadow Mode        → Agent runs against production traffic but
                     results are logged, not served
Canary Deployment  → 5% of traffic goes to new version,
                     compare metrics against baseline
Full Rollout       → Gradual ramp: 5% → 25% → 50% → 100%

Canary Deployments

# Kubernetes canary with Argo Rollouts
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: agent-worker
spec:
  replicas: 10
  strategy:
    canary:
      steps:
        - setWeight: 5
        - pause: { duration: 30m }
        - analysis:
            templates:
              - templateName: agent-success-rate
        - setWeight: 25
        - pause: { duration: 1h }
        - setWeight: 50
        - pause: { duration: 2h }
        - setWeight: 100

The analysis step is critical. It checks whether the canary's success rate, latency, and cost are within acceptable bounds before proceeding.

Rollback Triggers

Auto-rollback when:

  • Task completion rate drops below 85%
  • P95 latency exceeds 2x baseline
  • Average cost per request exceeds 1.5x baseline
  • Error rate exceeds 5%

Infrastructure Comparison

Choosing the right compute platform matters:

Factor Serverless (Lambda/Cloud Functions) Containers (ECS/Cloud Run) VMs (EC2/Compute Engine)
Cold start 1-10s (problematic for agents) Minimal with min instances None
Max execution time 15 min (Lambda) Unlimited Unlimited
Scaling speed Seconds 30s-2min Minutes
Cost at low volume Very low Low-moderate High (always running)
Cost at high volume Can be expensive Moderate Lowest per-unit
Complexity Low Moderate High
State management External only External recommended Local possible
Best for agents? Simple, short tasks Most agent workloads Heavy, long-running agents
Tip

Containers (ECS/Cloud Run/Kubernetes) are the sweet spot for most agent workloads. They handle the long execution times agents need, scale reasonably fast, and don't have the cold start problems of serverless. Use serverless only for lightweight agent tasks that complete in under a minute.

Real-World Deployment: Customer Support Agent on AWS

Let's walk through deploying a customer support agent on AWS. This isn't theoretical - it reflects a pattern I've seen work at scale.

Architecture

Route 53 (DNS)
    → ALB (Application Load Balancer)
        → ECS Service: API (Fargate, 2-10 tasks)
            → SQS Queue (agent tasks)
                → ECS Service: Workers (Fargate, 2-50 tasks)
                    → ElastiCache Redis (state + cache)
                    → RDS PostgreSQL (history + analytics)
                    → Secrets Manager (API keys)
                    → CloudWatch (logs + metrics)
                    → S3 (conversation archives)

Key Configuration

# config.py - production settings
class ProductionConfig:
    # Model settings
    PRIMARY_MODEL = "claude-sonnet-4-20250514"
    FALLBACK_MODEL = "claude-haiku"
    MAX_STEPS = 15
    STEP_TIMEOUT = 30  # seconds per step
    TOTAL_TIMEOUT = 180  # seconds per request
    MAX_COST_PER_REQUEST = 0.25  # USD

    # Scaling
    MIN_WORKERS = 2
    MAX_WORKERS = 50
    SCALE_UP_THRESHOLD = 10  # queue depth
    SCALE_DOWN_THRESHOLD = 2

    # Reliability
    MAX_RETRIES = 3
    CIRCUIT_BREAKER_THRESHOLD = 5  # failures before opening
    CIRCUIT_BREAKER_TIMEOUT = 60  # seconds before half-open

    # Caching
    RESPONSE_CACHE_TTL = 3600
    EMBEDDING_CACHE_TTL = 86400

What This Looks Like Running

With this setup, a typical request flow:

  1. User sends message via WebSocket or HTTP
  2. API server authenticates, validates, enqueues to SQS (50ms)
  3. Worker picks up task, loads conversation history from Redis (10ms)
  4. Agent runs 3-5 steps with tool calls (5-15 seconds)
  5. Response streamed back via WebSocket or returned via HTTP
  6. Conversation history persisted to PostgreSQL (async, 20ms)
  7. Metrics emitted to CloudWatch

Cost at moderate scale (10,000 conversations/day): roughly $500-1,500/month in LLM costs plus $200-400 in infrastructure.

Security in Production

Agent security deserves its own chapter (and it gets one earlier in this book), but here are the production-specific concerns:

API Authentication

Every request to your agent API must be authenticated. Use short-lived tokens, not API keys:

from fastapi import Depends, HTTPException
from fastapi.security import HTTPBearer

security = HTTPBearer()

async def verify_token(credentials = Depends(security)):
    token = credentials.credentials
    try:
        payload = jwt.decode(token, SECRET_KEY, algorithms=["HS256"])
        return payload
    except jwt.InvalidTokenError:
        raise HTTPException(status_code=401, detail="Invalid token")

Secrets Management

Never hardcode API keys. Never put them in environment variables in your Dockerfile. Use your cloud provider's secrets manager:

import boto3

def get_api_key(secret_name: str) -> str:
    client = boto3.client("secretsmanager")
    response = client.get_secret_value(SecretId=secret_name)
    return json.loads(response["SecretString"])["api_key"]

# Cache the secret in memory - don't fetch on every request
ANTHROPIC_API_KEY = get_api_key("prod/anthropic-api-key")

Network Isolation

Your agent workers should live in a private subnet with no direct internet access. They reach external APIs through a NAT gateway or VPC endpoints. This limits the blast radius if an agent is compromised.

VPC
├── Public Subnet
│   ├── ALB
│   └── NAT Gateway
├── Private Subnet
│   ├── API Service (ECS)
│   ├── Worker Service (ECS)
│   ├── ElastiCache
│   └── RDS
Warning

Agents that execute code or use browser tools present elevated security risks. Run these tools in isolated sandboxes - separate containers, restricted network access, ephemeral filesystems. Treat every tool execution as potentially hostile input.

Deployment Checklist

Before you push to production, run through this checklist:

  • All API keys in secrets manager, not environment variables or code
  • Health checks configured for all services
  • Auto-scaling configured with sensible min/max bounds
  • Dead letter queue configured for failed tasks
  • Structured logging with correlation IDs (session_id, task_id)
  • Cost alerts set at 50%, 80%, and 100% of budget
  • Latency alerts on P95
  • Error rate alerts with auto-rollback
  • Eval suite passing at >= 90% accuracy
  • Rate limiting configured per user and globally
  • Network isolation configured (private subnets for workers)
  • Backup and restore tested for state databases
  • Runbook written for common failure scenarios
  • On-call rotation established

Production deployment isn't a one-time event. It's an ongoing practice. Your agent will behave differently at scale than it did in testing, and you'll need to iterate on prompts, tools, and architecture as you learn from real traffic. The infrastructure in this chapter gives you the foundation to do that iteration safely.

In the next chapter, we'll look at real-world use cases - how teams across different industries have built and deployed agents successfully, and what lessons they've learned along the way.