Deepak Gupta

Deploying Agents to Production

You built an agent. It works on your laptop. Your demo wowed the team. Now comes the part nobody warns you about: making it work reliably for thousands of users, 24/7, without burning through your entire cloud budget in a week.

The gap between a demo agent and a production agent is enormous - arguably bigger than the gap for traditional software. Traditional apps have predictable execution paths. Agents don't. They make decisions, call tools in unpredictable orders, and sometimes go on expensive tangents. Deploying them requires a fundamentally different mindset.

This chapter is your field guide to crossing that gap. We'll cover architecture, scaling, observability, cost control, and everything else you need to sleep soundly while your agent handles real traffic.

The Demo-to-Production Gap

Let's be honest about what your demo agent is missing:

Aspect	Demo	Production
Concurrency	1 user at a time	Hundreds or thousands simultaneous
Error handling	`try/catch`, maybe a retry	Dead letter queues, circuit breakers, graceful degradation
State management	In-memory dict	Redis, PostgreSQL, distributed state
Cost control	"We'll optimize later"	Per-request budgets, model routing, caching
Observability	`print()` statements	Structured logging, distributed tracing, alerting
Latency	"It's fast enough"	P95 latency targets, streaming, async execution
Security	Hardcoded API keys	Secrets management, network isolation, auth layers
Testing	Manual "does it look right?"	Automated eval suites, regression tests, canary deployments

If you're reading this list and thinking "we only need half of these," you're probably wrong. Production has a way of punishing optimism.

Architecture for Production Agents

A production agent system is not a single process. It's a distributed system with several distinct layers:

┌─────────────────────────────────────────────────────┐
│                   API Gateway                        │
│         (Authentication, Rate Limiting)              │
├─────────────────────────────────────────────────────┤
│                  API Layer                           │
│     (Request validation, session management)         │
├─────────────────────────────────────────────────────┤
│              Task Queue (Redis / SQS)                │
├──────────┬──────────┬──────────┬────────────────────┤
│ Worker 1 │ Worker 2 │ Worker 3 │  ... Worker N      │
│ (Agent)  │ (Agent)  │ (Agent)  │  (Agent)           │
├──────────┴──────────┴──────────┴────────────────────┤
│           Shared State (Redis + PostgreSQL)          │
├─────────────────────────────────────────────────────┤
│     Tool Services (APIs, DBs, External Services)    │
└─────────────────────────────────────────────────────┘

The API Layer receives requests, authenticates users, validates input, and enqueues work. It should never run agent logic directly - that's a recipe for timeouts and resource exhaustion.

The Task Queue decouples request acceptance from processing. When a user sends a message, the API layer returns immediately with a task ID. The agent work happens asynchronously in workers.

Workers are the processes that actually run your agents. They pull tasks from the queue, execute the agent loop, and store results. Each worker runs one agent task at a time (usually), which makes scaling straightforward.

Shared State is where conversation history, tool results, and agent memory live. This must be accessible to any worker, because you can't guarantee which worker will handle which request.

Tip

Start with a simple architecture. Many teams over-engineer from day one. If you have fewer than 100 concurrent users, a single API server with a Redis queue and a few workers is plenty. Scale when you have data showing you need to.

Containerization: Packaging Agents with Docker

Agents have more dependencies than typical web apps - model client libraries, tool-specific packages, possibly system-level binaries for things like PDF parsing or browser automation. Docker is the sane way to manage this.

Here's a production-grade Dockerfile for a Python-based agent:

# Stage 1: Build dependencies
FROM python:3.12-slim AS builder

WORKDIR /app

# Install system dependencies needed for building
RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt

# Stage 2: Runtime
FROM python:3.12-slim AS runtime

WORKDIR /app

# Install only runtime system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Copy installed Python packages from builder
COPY --from=builder /install /usr/local

# Copy application code
COPY src/ ./src/
COPY config/ ./config/

# Create non-root user
RUN useradd -m -r agentuser && chown -R agentuser:agentuser /app
USER agentuser

# Health check
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
    CMD curl -f http://localhost:8080/health || exit 1

ENV PYTHONUNBUFFERED=1
ENV PYTHONDONTWRITEBYTECODE=1

ENTRYPOINT ["python", "-m", "src.worker"]

A few things worth noting:

Multi-stage build keeps the final image small. Build tools don't ship to production.
Non-root user is non-negotiable for security. Never run agents as root.
Health check lets your orchestrator know when a worker is stuck. Agents can hang - an LLM call might never return, a tool might deadlock. Health checks catch this.
PYTHONUNBUFFERED ensures logs appear immediately, which matters enormously when debugging agent behavior.

For the API server, you'd have a separate Dockerfile (or a different entrypoint in the same image):

ENTRYPOINT ["uvicorn", "src.api:app", "--host", "0.0.0.0", "--port", "8080"]

Warning

If your agent uses browser automation tools (Playwright, Selenium), you'll need a significantly larger base image with Chromium. Consider running browser tools as a separate microservice to keep your main agent image lean.

Scaling Strategies

Agents are weird to scale because their resource usage is bursty and unpredictable. A simple question might take 2 seconds and one LLM call. A complex task might take 90 seconds and fifteen LLM calls with tool use.

Horizontal Scaling with Queue-Based Processing

The most reliable pattern is queue-based processing with auto-scaling workers:

# worker.py - simplified production worker
import asyncio
import redis.asyncio as redis
import json
from agent import Agent
from config import settings

pool = redis.ConnectionPool.from_url(settings.REDIS_URL)

async def process_task(task_data: dict):
    agent = Agent(
        model=task_data.get("model", "claude-sonnet-4-20250514"),
        max_steps=task_data.get("max_steps", 25),
        timeout=task_data.get("timeout", 120),
    )

    try:
        result = await agent.run(
            messages=task_data["messages"],
            tools=task_data["tools"],
            session_id=task_data["session_id"],
        )
        await store_result(task_data["task_id"], result)
    except Exception as e:
        await store_error(task_data["task_id"], e)
        await maybe_retry(task_data)

async def worker_loop():
    r = redis.Redis(connection_pool=pool)
    while True:
        # BRPOP blocks until a task is available (or timeout)
        _, raw = await r.brpop("agent:tasks", timeout=5)
        if raw:
            task = json.loads(raw)
            await process_task(task)

if __name__ == "__main__":
    asyncio.run(worker_loop())

Scale workers based on queue depth. Most cloud platforms support this natively:

AWS: ECS tasks scaling on SQS queue depth via CloudWatch alarms
GCP: Cloud Run jobs scaling on Pub/Sub message count
Kubernetes: KEDA (Kubernetes Event-Driven Autoscaling) watching Redis list length

Load Balancing Considerations

Standard round-robin load balancing works for the API layer, but think carefully about sticky sessions if your agent maintains conversational state. Options:

Stateless workers + shared state (recommended): Any worker can handle any request. State lives in Redis/PostgreSQL.
Sticky sessions: Route all requests from one conversation to the same worker. Simpler code, but harder to scale and recover from failures.
Hybrid: Keep hot state (current turn) in-worker, persist cold state (history) to shared storage.

State Management in Production

Agents are stateful creatures. They maintain conversation history, tool results, memory, and sometimes long-running context. Here's how to manage this:

State Type	Storage	TTL	Example
Conversation history	PostgreSQL	Days to months	Full message history for a session
Current turn context	Redis	Minutes to hours	Active agent loop state, tool call results
Agent memory	PostgreSQL + vector DB	Long-term	Learned facts, user preferences
Task queue	Redis / SQS	Minutes	Pending agent tasks
Rate limiting	Redis	Seconds to minutes	Per-user request counts
Cache	Redis	Hours to days	Repeated query results, embeddings

# State management pattern
class AgentStateManager:
    def __init__(self, redis_client, pg_pool):
        self.redis = redis_client
        self.pg = pg_pool

    async def save_turn(self, session_id: str, messages: list):
        """Save current turn to Redis for fast access."""
        key = f"session:{session_id}:current"
        await self.redis.setex(key, 3600, json.dumps(messages))

    async def persist_history(self, session_id: str, messages: list):
        """Persist full history to PostgreSQL."""
        async with self.pg.acquire() as conn:
            await conn.execute(
                """INSERT INTO conversation_history
                   (session_id, messages, updated_at)
                   VALUES ($1, $2, NOW())
                   ON CONFLICT (session_id)
                   DO UPDATE SET messages = $2, updated_at = NOW()""",
                session_id, json.dumps(messages)
            )

    async def get_context(self, session_id: str) -> list:
        """Try Redis first, fall back to PostgreSQL."""
        cached = await self.redis.get(f"session:{session_id}:current")
        if cached:
            return json.loads(cached)

        async with self.pg.acquire() as conn:
            row = await conn.fetchrow(
                "SELECT messages FROM conversation_history WHERE session_id = $1",
                session_id
            )
            return json.loads(row["messages"]) if row else []

Observability: Seeing What Your Agent Does

Traditional apps log request/response. Agents need much more. Each agent run is a tree of decisions, tool calls, and model invocations. You need to see all of it.

Structured Logging

Every log line should be a JSON object with consistent fields:

import structlog

logger = structlog.get_logger()

async def run_agent_step(session_id, step_num, message):
    logger.info(
        "agent_step_start",
        session_id=session_id,
        step=step_num,
        input_tokens=count_tokens(message),
    )

    result = await llm.generate(message)

    logger.info(
        "agent_step_complete",
        session_id=session_id,
        step=step_num,
        output_tokens=result.usage.output_tokens,
        model=result.model,
        tool_calls=len(result.tool_calls),
        latency_ms=result.latency_ms,
        cost_usd=calculate_cost(result.usage),
    )

Distributed Tracing

Use OpenTelemetry to trace the full lifecycle of an agent run. Each tool call, each LLM invocation, each database query becomes a span in a trace:

from opentelemetry import trace

tracer = trace.get_tracer("agent-service")

async def run_agent(session_id: str, user_message: str):
    with tracer.start_as_current_span("agent_run") as span:
        span.set_attribute("session_id", session_id)

        for step in range(max_steps):
            with tracer.start_as_current_span(f"step_{step}"):
                response = await call_llm(messages)

                if response.tool_calls:
                    for tool_call in response.tool_calls:
                        with tracer.start_as_current_span(f"tool_{tool_call.name}"):
                            result = await execute_tool(tool_call)

This gives you a waterfall view of every agent run - invaluable when debugging why a request took 45 seconds or cost $0.50.

Key Metrics Dashboard

Build a dashboard with these metrics from day one:

Metric	Why It Matters	Alert Threshold
P50/P95/P99 latency	User experience	P95 > 30s
Cost per request	Budget control	> $0.10/request
Steps per completion	Agent efficiency	Avg > 10 steps
Tool call success rate	Tool reliability	< 95%
LLM error rate	Provider reliability	> 2%
Queue depth	Capacity planning	> 100 pending
Task completion rate	Agent effectiveness	< 90%
Tokens per request	Cost trending	Sudden spikes

Error Handling at Scale

Agents fail in creative ways. The LLM hallucinates a tool name. An API returns a 500. The model provider rate-limits you. A tool execution hangs forever. You need defense in depth.

Retry Strategy

Not all errors are retryable. Be specific:

from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception

def is_retryable(exception):
    """Only retry transient errors."""
    if isinstance(exception, RateLimitError):
        return True
    if isinstance(exception, APIConnectionError):
        return True
    if isinstance(exception, TimeoutError):
        return True
    # Don't retry validation errors, auth errors, etc.
    return False

@retry(
    retry=retry_if_exception(is_retryable),
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=30),
)
async def call_llm_with_retry(messages, model):
    return await llm.generate(messages=messages, model=model)

Dead Letter Queues

When a task fails all retries, don't lose it. Send it to a dead letter queue for human review:

async def maybe_retry(task_data: dict):
    retries = task_data.get("retry_count", 0)
    if retries < MAX_RETRIES:
        task_data["retry_count"] = retries + 1
        await redis.lpush("agent:tasks", json.dumps(task_data))
    else:
        # Send to dead letter queue
        await redis.lpush("agent:dead_letter", json.dumps({
            **task_data,
            "failed_at": datetime.utcnow().isoformat(),
            "last_error": task_data.get("last_error"),
        }))
        await alert_ops_team(task_data)

Graceful Degradation

When things go wrong, degrade gracefully instead of failing completely:

Model fallback: If Claude Opus times out, retry with Sonnet. If Sonnet is down, fall back to Haiku with a simplified prompt.
Tool fallback: If the primary search API is down, use a cached result or skip the search step and tell the user.
Budget exhaustion: If a request exceeds its cost budget, stop the agent loop and return the best partial result.

Note

The best production agents have a "minimum viable response" - the simplest useful answer they can give when everything else fails. Even if all tools are down and the expensive model is unavailable, returning a helpful message beats returning a 500 error.

Cost Optimization

LLM costs can be staggering at scale. A single complex agent run might cost $0.50 or more. Multiply that by thousands of daily requests and you're looking at serious money.

Model Routing

The single most impactful optimization. Not every request needs your most powerful model:

class ModelRouter:
    async def select_model(self, message: str, context: dict) -> str:
        complexity = await self.estimate_complexity(message)

        if complexity == "simple":
            # Greetings, FAQ, simple lookups
            return "claude-haiku"
        elif complexity == "moderate":
            # Most tool-use tasks, standard queries
            return "claude-sonnet"
        else:
            # Multi-step reasoning, complex analysis
            return "claude-opus"

    async def estimate_complexity(self, message: str) -> str:
        # Use a fast classifier - could be a small model,
        # keyword heuristics, or even regex
        indicators = {
            "simple": ["hello", "thanks", "what is", "define"],
            "complex": ["analyze", "compare", "debug", "refactor",
                       "write a comprehensive", "step by step"],
        }
        # ... classification logic

Real-world impact: routing 60% of traffic to Haiku instead of Sonnet can cut costs by 80% on those requests with minimal quality loss.

Response Caching

Cache at multiple levels:

async def get_or_generate(prompt_hash: str, generate_fn):
    # L1: Check exact match cache
    cached = await redis.get(f"cache:exact:{prompt_hash}")
    if cached:
        return json.loads(cached)

    # L2: Check semantic cache (for similar queries)
    similar = await vector_db.search(prompt_embedding, threshold=0.95)
    if similar:
        return similar[0].response

    # Cache miss: generate and cache
    result = await generate_fn()
    await redis.setex(f"cache:exact:{prompt_hash}", 3600, json.dumps(result))
    return result

Batching

If you're processing many independent requests, batch them:

# Instead of N individual API calls for embeddings:
embeddings = await client.embeddings.create(
    input=all_texts,  # Send all at once
    model="text-embedding-3-small"
)

Latency Optimization

Users hate waiting. Here's how to make agents feel fast:

Streaming Responses

Stream the agent's final response as it's generated. Don't wait for the full response:

async def stream_agent_response(session_id: str, message: str):
    agent = Agent(model="claude-sonnet-4-20250514")

    async for event in agent.run_stream(message):
        if event.type == "thinking":
            yield {"type": "status", "text": "Analyzing your request..."}
        elif event.type == "tool_call":
            yield {"type": "status", "text": f"Using {event.tool_name}..."}
        elif event.type == "text_delta":
            yield {"type": "content", "text": event.text}

This transforms a 20-second wait into a responsive experience where the user sees progress the entire time.

Parallel Tool Execution

When the agent calls multiple tools that don't depend on each other, run them in parallel:

async def execute_tools(tool_calls: list) -> list:
    # Group independent calls
    tasks = [execute_single_tool(tc) for tc in tool_calls]
    results = await asyncio.gather(*tasks, return_exceptions=True)
    return results

This alone can cut agent latency by 40-60% in multi-tool scenarios.

CI/CD for Agents

Agents need a different testing and deployment strategy than traditional software.

Testing Strategy

Unit Tests         → Individual tool functions, prompt formatting
Integration Tests  → Agent + real tools against test fixtures
Eval Suite         → Agent against a benchmark of 50-200 test cases
                     with automated scoring (accuracy, cost, latency)
Shadow Mode        → Agent runs against production traffic but
                     results are logged, not served
Canary Deployment  → 5% of traffic goes to new version,
                     compare metrics against baseline
Full Rollout       → Gradual ramp: 5% → 25% → 50% → 100%

Canary Deployments

# Kubernetes canary with Argo Rollouts
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: agent-worker
spec:
  replicas: 10
  strategy:
    canary:
      steps:
        - setWeight: 5
        - pause: { duration: 30m }
        - analysis:
            templates:
              - templateName: agent-success-rate
        - setWeight: 25
        - pause: { duration: 1h }
        - setWeight: 50
        - pause: { duration: 2h }
        - setWeight: 100

The analysis step is critical. It checks whether the canary's success rate, latency, and cost are within acceptable bounds before proceeding.

Rollback Triggers

Auto-rollback when:

Task completion rate drops below 85%
P95 latency exceeds 2x baseline
Average cost per request exceeds 1.5x baseline
Error rate exceeds 5%

Infrastructure Comparison

Choosing the right compute platform matters:

Factor	Serverless (Lambda/Cloud Functions)	Containers (ECS/Cloud Run)	VMs (EC2/Compute Engine)
Cold start	1-10s (problematic for agents)	Minimal with min instances	None
Max execution time	15 min (Lambda)	Unlimited	Unlimited
Scaling speed	Seconds	30s-2min	Minutes
Cost at low volume	Very low	Low-moderate	High (always running)
Cost at high volume	Can be expensive	Moderate	Lowest per-unit
Complexity	Low	Moderate	High
State management	External only	External recommended	Local possible
Best for agents?	Simple, short tasks	Most agent workloads	Heavy, long-running agents

Tip

Containers (ECS/Cloud Run/Kubernetes) are the sweet spot for most agent workloads. They handle the long execution times agents need, scale reasonably fast, and don't have the cold start problems of serverless. Use serverless only for lightweight agent tasks that complete in under a minute.

Real-World Deployment: Customer Support Agent on AWS

Let's walk through deploying a customer support agent on AWS. This isn't theoretical - it reflects a pattern I've seen work at scale.

Architecture

Route 53 (DNS)
    → ALB (Application Load Balancer)
        → ECS Service: API (Fargate, 2-10 tasks)
            → SQS Queue (agent tasks)
                → ECS Service: Workers (Fargate, 2-50 tasks)
                    → ElastiCache Redis (state + cache)
                    → RDS PostgreSQL (history + analytics)
                    → Secrets Manager (API keys)
                    → CloudWatch (logs + metrics)
                    → S3 (conversation archives)

Key Configuration

# config.py - production settings
class ProductionConfig:
    # Model settings
    PRIMARY_MODEL = "claude-sonnet-4-20250514"
    FALLBACK_MODEL = "claude-haiku"
    MAX_STEPS = 15
    STEP_TIMEOUT = 30  # seconds per step
    TOTAL_TIMEOUT = 180  # seconds per request
    MAX_COST_PER_REQUEST = 0.25  # USD

    # Scaling
    MIN_WORKERS = 2
    MAX_WORKERS = 50
    SCALE_UP_THRESHOLD = 10  # queue depth
    SCALE_DOWN_THRESHOLD = 2

    # Reliability
    MAX_RETRIES = 3
    CIRCUIT_BREAKER_THRESHOLD = 5  # failures before opening
    CIRCUIT_BREAKER_TIMEOUT = 60  # seconds before half-open

    # Caching
    RESPONSE_CACHE_TTL = 3600
    EMBEDDING_CACHE_TTL = 86400

What This Looks Like Running

With this setup, a typical request flow:

User sends message via WebSocket or HTTP
API server authenticates, validates, enqueues to SQS (50ms)
Worker picks up task, loads conversation history from Redis (10ms)
Agent runs 3-5 steps with tool calls (5-15 seconds)
Response streamed back via WebSocket or returned via HTTP
Conversation history persisted to PostgreSQL (async, 20ms)
Metrics emitted to CloudWatch

Cost at moderate scale (10,000 conversations/day): roughly $500-1,500/month in LLM costs plus $200-400 in infrastructure.

Security in Production

Agent security deserves its own chapter (and it gets one earlier in this book), but here are the production-specific concerns:

API Authentication

Every request to your agent API must be authenticated. Use short-lived tokens, not API keys:

from fastapi import Depends, HTTPException
from fastapi.security import HTTPBearer

security = HTTPBearer()

async def verify_token(credentials = Depends(security)):
    token = credentials.credentials
    try:
        payload = jwt.decode(token, SECRET_KEY, algorithms=["HS256"])
        return payload
    except jwt.InvalidTokenError:
        raise HTTPException(status_code=401, detail="Invalid token")

Secrets Management

Never hardcode API keys. Never put them in environment variables in your Dockerfile. Use your cloud provider's secrets manager:

import boto3

def get_api_key(secret_name: str) -> str:
    client = boto3.client("secretsmanager")
    response = client.get_secret_value(SecretId=secret_name)
    return json.loads(response["SecretString"])["api_key"]

# Cache the secret in memory - don't fetch on every request
ANTHROPIC_API_KEY = get_api_key("prod/anthropic-api-key")

Network Isolation

Your agent workers should live in a private subnet with no direct internet access. They reach external APIs through a NAT gateway or VPC endpoints. This limits the blast radius if an agent is compromised.

VPC
├── Public Subnet
│   ├── ALB
│   └── NAT Gateway
├── Private Subnet
│   ├── API Service (ECS)
│   ├── Worker Service (ECS)
│   ├── ElastiCache
│   └── RDS

Warning

Agents that execute code or use browser tools present elevated security risks. Run these tools in isolated sandboxes - separate containers, restricted network access, ephemeral filesystems. Treat every tool execution as potentially hostile input.

Deployment Checklist

Before you push to production, run through this checklist:

Production deployment isn't a one-time event. It's an ongoing practice. Your agent will behave differently at scale than it did in testing, and you'll need to iterate on prompts, tools, and architecture as you learn from real traffic. The infrastructure in this chapter gives you the foundation to do that iteration safely.

In the next chapter, we'll look at real-world use cases - how teams across different industries have built and deployed agents successfully, and what lessons they've learned along the way.