Skip to content

Real-World Use Cases and Case Studies

Theory is useful. Working code is better. But nothing teaches like seeing what happens when real teams deploy agents to solve real problems - the unexpected wins, the painful failures, and the lessons that only come from production.

This chapter is a collection of use cases drawn from how organizations are actually using AI agents today. For each one, we'll look at the architecture, how it works in practice, and - most importantly - what the teams learned that they wish they'd known at the start.

Software Engineering Agents

This is the use case that's moved the fastest, and for good reason: developers building tools for developers creates a tight feedback loop.

The Landscape

The software engineering agent space has stratified into clear tiers:

Tool Type Approach Best For
Claude Code CLI agent Autonomous coding in terminal, full repo access Complex multi-file changes, refactoring
Cursor IDE-integrated agent Inline suggestions + agent mode in editor Day-to-day coding with context
GitHub Copilot Workspace Cloud agent Issue-to-PR automation Well-defined issues with clear specs
Aider CLI agent Git-aware pair programming Open source, local-first workflows
Devin Autonomous agent Full SDLC from planning to PR Delegating complete tasks

How a Code Review Agent Works

Let's look at a concrete architecture - an agent that reviews pull requests:

GitHub Webhook (PR opened/updated)
    → API Gateway
        → Review Queue (SQS)
            → Review Worker
                ├── Fetch diff from GitHub API
                ├── Fetch relevant files for context
                ├── Analyze changes with LLM
                │   ├── Check for bugs
                │   ├── Check for security issues
                │   ├── Check style/conventions
                │   └── Check test coverage
                ├── Format review comments
                └── Post review via GitHub API
async def review_pull_request(pr_data: dict):
    # Fetch the diff and surrounding context
    diff = await github.get_pr_diff(pr_data["repo"], pr_data["number"])
    changed_files = parse_diff(diff)

    # Get full file contents for changed files (need context)
    file_contexts = {}
    for file in changed_files:
        file_contexts[file.path] = await github.get_file_content(
            pr_data["repo"], file.path, pr_data["base_sha"]
        )

    # Build review prompt with structured output
    review = await agent.run(
        system="""You are an expert code reviewer. Analyze the PR diff
        and provide specific, actionable feedback. Focus on:
        1. Bugs and logic errors
        2. Security vulnerabilities
        3. Performance issues
        4. Code style and maintainability

        For each issue, cite the specific file and line number.
        Rate severity as: critical, warning, or suggestion.""",
        messages=[{
            "role": "user",
            "content": format_review_context(diff, file_contexts)
        }],
        output_schema=ReviewSchema,
    )

    # Post comments on the PR
    for comment in review.comments:
        await github.create_review_comment(
            repo=pr_data["repo"],
            pr_number=pr_data["number"],
            body=comment.message,
            path=comment.file,
            line=comment.line,
        )

Key Learnings from Software Engineering Agents

  1. Context is everything. The same model that writes brilliant code with full repo context writes garbage with just a function signature. Invest heavily in context retrieval - file dependency graphs, recent git history, coding conventions.

  2. Agents are better at specific tasks than general coding. A "write any code" agent underperforms compared to specialized agents for testing, refactoring, debugging, and documentation.

  3. Human review remains essential. Even the best coding agents introduce subtle bugs approximately 5-15% of the time. The workflow should be agent-generates, human-reviews - not agent-deploys.

Tip

The biggest ROI in software engineering agents isn't code generation - it's code review and test writing. These are tasks developers avoid, so the bar for "good enough" is lower, and the productivity gain is enormous.

Customer Support Agents

Customer support was one of the first domains where agents delivered clear ROI. The economics are straightforward: a human support agent costs $15-25/hour. An AI agent handling the same ticket costs $0.05-0.50.

Architecture

Customer Message (chat/email/ticket)
    → Classification Agent
        ├── Simple Query → Direct Response Agent
        │   └── Response (no human needed)
        ├── Known Issue → Resolution Agent
        │   ├── Execute resolution steps (refund, reset, etc.)
        │   └── Confirm with customer
        ├── Complex Issue → Escalation Agent
        │   ├── Gather context and diagnostics
        │   ├── Summarize for human agent
        │   └── Route to appropriate team
        └── Sensitive Issue → Direct to Human
            (billing disputes, legal, angry customers)

The Three-Tier Model

The most successful support agent deployments use three tiers:

Tier 1 - Fully Automated (40-60% of tickets): Password resets, order status, FAQ answers, simple account changes. The agent handles these end-to-end with no human involvement.

Tier 2 - Agent-Assisted (25-35% of tickets): The agent diagnoses the issue, gathers relevant information, drafts a response, and presents it to a human agent for approval. The human reviews and sends - or edits and sends.

Tier 3 - Human with Context (10-20% of tickets): Complex, sensitive, or novel issues where the agent can't help directly. But even here, the agent adds value by pulling up the customer's history, summarizing the issue, and suggesting relevant knowledge base articles.

class TicketClassifier:
    async def classify(self, ticket: dict) -> TicketRoute:
        # Check for sensitive topics first (rule-based, not LLM)
        if self.contains_sensitive_topic(ticket["message"]):
            return TicketRoute.HUMAN_DIRECT

        # Use LLM for nuanced classification
        classification = await self.agent.run(
            system="Classify this support ticket.",
            messages=[{"role": "user", "content": ticket["message"]}],
            output_schema=TicketClassification,
        )

        if classification.confidence > 0.9 and classification.tier == "simple":
            return TicketRoute.AUTOMATED
        elif classification.confidence > 0.7:
            return TicketRoute.AGENT_ASSISTED
        else:
            return TicketRoute.HUMAN_WITH_CONTEXT

Key Learnings

  • Start with deflection, not resolution. Before building an agent that fixes problems, build one that answers questions. FAQ deflection has the highest ROI and lowest risk.
  • Tone matters more than accuracy. Customers will forgive a slightly imperfect answer delivered empathetically. They won't forgive a technically correct answer that sounds robotic.
  • Escalation is a feature, not a failure. The best support agents know when to hand off. Building great escalation paths (with full context transfer) matters more than pushing automation rates higher.

Data Analysis Agents

Data analysis agents translate natural language into SQL, generate reports, and surface anomalies. They're popular because they democratize data access - anyone can query the database without learning SQL.

How Natural Language to SQL Works

class DataAnalysisAgent:
    def __init__(self, db_schema: str, examples: list):
        self.system_prompt = f"""You are a data analyst. Convert natural
        language questions into SQL queries for this database:

        {db_schema}

        Rules:
        - Always use table aliases
        - Limit results to 1000 rows unless asked for more
        - Use CTEs for complex queries
        - Never run DELETE, UPDATE, INSERT, or DROP statements

        Example queries:
        {self.format_examples(examples)}"""

    async def analyze(self, question: str) -> AnalysisResult:
        # Step 1: Generate SQL
        sql = await self.generate_sql(question)

        # Step 2: Validate SQL (parse tree analysis, not execution)
        validation = self.validate_sql(sql)
        if not validation.safe:
            return AnalysisResult(error="Query contains unsafe operations")

        # Step 3: Execute with timeout and row limit
        results = await self.execute_with_guard(sql, timeout=30, max_rows=1000)

        # Step 4: Generate natural language summary
        summary = await self.summarize_results(question, sql, results)

        return AnalysisResult(sql=sql, data=results, summary=summary)
Warning

Never let an agent execute arbitrary SQL against your production database. Always use a read-only replica, enforce query timeouts, restrict to SELECT statements, and limit result sizes. A single bad query can take down your database.

Anomaly Detection Pattern

async def detect_anomalies(metric: str, timeframe: str):
    # Fetch historical data
    history = await db.query(
        f"SELECT date, {metric} FROM metrics WHERE date > NOW() - INTERVAL '{timeframe}'"
    )

    # Ask the agent to analyze
    analysis = await agent.run(
        system="You are a data analyst. Identify anomalies and explain them.",
        messages=[{
            "role": "user",
            "content": f"Analyze this {metric} data for anomalies:\n{format_data(history)}"
        }],
        tools=[query_tool, chart_tool, compare_tool],
    )
    return analysis

Key Learnings

  • Schema descriptions matter enormously. Column names like crt_dt and usr_id are meaningless to an LLM. Add natural language descriptions to your schema: "crt_dt: the date the customer account was created."
  • Few-shot examples are your best investment. Five well-chosen example question-SQL pairs improve accuracy more than any amount of prompt engineering.
  • Results interpretation is as important as SQL generation. A correct query with a wrong interpretation is worse than no answer at all.

Sales and Marketing Agents

Sales agents handle lead qualification, personalized outreach, and content creation. The key insight: these agents don't replace salespeople - they multiply them.

Lead Qualification Agent

class LeadQualificationAgent:
    """Scores incoming leads and routes them to the right sales motion."""

    async def qualify(self, lead: dict) -> QualificationResult:
        # Enrich lead data from external sources
        enriched = await self.enrich_lead(lead)

        result = await self.agent.run(
            system="""Score this lead on the BANT framework:
            - Budget: Can they afford our solution?
            - Authority: Is this a decision maker?
            - Need: Do they have a problem we solve?
            - Timeline: Are they buying soon?

            Score each dimension 1-5. Total >= 16: hot lead.
            Total 10-15: warm. Total < 10: nurture.""",
            messages=[{"role": "user", "content": json.dumps(enriched)}],
            output_schema=BANTScore,
        )

        await self.route_lead(lead, result)
        return result

Key Learnings

  • Personalization at scale is the killer feature. An agent that writes 100 genuinely personalized emails outperforms 1,000 template emails every time.
  • Compliance is non-negotiable. Email agents must respect CAN-SPAM, GDPR, and platform-specific rules. Build compliance checks into the agent, not around it.
  • Measure downstream metrics. Open rates and click rates are vanity metrics. Measure meetings booked and deals closed.

Security Agents

Security is a natural fit for agents because the work is high-volume, time-sensitive, and follows analyzable patterns.

Threat Detection and Incident Response

class SecurityAgent:
    async def triage_alert(self, alert: dict) -> TriageResult:
        # Gather context from multiple sources
        context = await asyncio.gather(
            self.get_asset_info(alert["source_ip"]),
            self.get_recent_alerts(alert["source_ip"], hours=24),
            self.get_threat_intel(alert["indicators"]),
            self.get_user_context(alert.get("user_id")),
        )

        result = await self.agent.run(
            system="""You are a SOC analyst. Triage this security alert.
            Determine: true positive, false positive, or needs investigation.
            If true positive, recommend immediate containment actions.
            Cite specific evidence for your determination.""",
            messages=[{
                "role": "user",
                "content": format_alert_context(alert, context)
            }],
            tools=[
                lookup_ioc_tool,
                check_vuln_db_tool,
                query_siem_tool,
                isolate_host_tool,  # requires human approval
            ],
        )
        return result

Key Learnings

  • False positive reduction is the biggest win. SOC teams are drowning in alerts. An agent that correctly dismisses 60% of false positives saves more analyst time than one that investigates true positives.
  • Never automate containment without human approval. Isolating a host or blocking an IP can cause outages. The agent recommends; the human approves.
  • Speed of enrichment matters most. An analyst who gets pre-enriched context (IP reputation, asset ownership, recent activity) resolves alerts 3-5x faster even without the agent making the decision.

DevOps Agents

DevOps agents handle infrastructure management, deployment automation, and incident triage. They're particularly effective because DevOps work involves many routine, well-documented procedures.

Incident Triage Agent

class IncidentTriageAgent:
    """Runs initial investigation when PagerDuty alerts fire."""

    tools = [
        check_service_health,    # HTTP health check
        query_metrics,           # Prometheus/Datadog query
        read_recent_logs,        # CloudWatch/ELK query
        list_recent_deployments, # GitHub/ArgoCD query
        check_dependencies,      # Upstream service status
        run_diagnostic,          # Pre-approved diagnostic commands
    ]

    async def triage(self, incident: dict) -> TriageReport:
        report = await self.agent.run(
            system="""You are an SRE investigating a production incident.
            Your job is to gather data, identify the likely root cause,
            and recommend remediation steps.

            Check in this order:
            1. Recent deployments (most common cause)
            2. Dependency health
            3. Resource utilization (CPU, memory, disk)
            4. Error logs
            5. Traffic patterns

            DO NOT take any remediation action. Only investigate and report.""",
            messages=[{
                "role": "user",
                "content": f"Incident: {incident['title']}\n"
                          f"Service: {incident['service']}\n"
                          f"Severity: {incident['severity']}\n"
                          f"Alert details: {incident['details']}"
            }],
            tools=self.tools,
        )
        return report

Key Learnings

  • Investigation is safer to automate than remediation. Let the agent pull data from 10 sources in parallel while the on-call engineer is waking up. By the time the human is ready, they have a complete picture.
  • Runbooks are agent food. If you have documented runbooks, they convert almost directly into agent instructions. The agent becomes an automated runbook executor.

Research Agents

Research agents automate literature review, competitive analysis, and market research. They shine at synthesizing large volumes of information.

Competitive Analysis Agent

class CompetitiveAnalysisAgent:
    tools = [
        web_search,           # Search the web
        fetch_webpage,        # Read a webpage
        search_news,          # Recent news about a company
        query_crunchbase,     # Funding and company data
        analyze_pricing,      # Extract pricing from a page
    ]

    async def analyze_competitor(self, company: str) -> CompetitorReport:
        return await self.agent.run(
            system="""Research this company thoroughly. Find:
            1. Product offerings and positioning
            2. Pricing model and tiers
            3. Recent funding and financials
            4. Key customers and case studies
            5. Recent product launches and strategy shifts
            6. Strengths and weaknesses vs our product

            Cite your sources. Distinguish facts from inferences.""",
            messages=[{"role": "user", "content": f"Analyze: {company}"}],
            tools=self.tools,
        )

Key Learnings

  • Citation is critical. Research agents that don't cite sources produce unreliable output. Build citation into the output schema.
  • Freshness matters. Research agents need access to current data. A model's training cutoff makes it unreliable for recent events - web search tools are essential.
  • Synthesis > summarization. The real value isn't summarizing five articles - it's synthesizing insights across them that no single article contains.

Personal Productivity Agents

These agents manage email, scheduling, and task prioritization. They're compelling but face the hardest trust barrier - they need access to your most personal data.

Email Triage Agent

The most effective pattern is triage, not autonomous response:

class EmailTriageAgent:
    async def process_inbox(self, emails: list) -> list[TriagedEmail]:
        results = []
        for email in emails:
            triage = await self.agent.run(
                system="""Categorize this email and suggest an action:
                - urgent_response: Needs reply within 1 hour
                - normal_response: Needs reply today
                - fyi: Read but no response needed
                - delegate: Forward to someone else (suggest who)
                - archive: Not important

                For urgent and normal, draft a response.""",
                messages=[{"role": "user", "content": format_email(email)}],
            )
            results.append(triage)
        return results

Key Learnings

  • Start with read-only. Let the agent categorize and suggest before you let it send or schedule anything.
  • Privacy is the make-or-break. Users won't adopt an email agent that sends their correspondence to a third-party API. On-device or private cloud processing is often required.

Use Case Comparison

Use Case Architecture Complexity Time to Value ROI Potential
Code review agent Event-driven, async Medium 2-4 weeks High
Customer support Multi-tier, queue-based High 4-8 weeks Very high
Data analysis Request-response Medium 2-3 weeks High
Sales outreach Batch + real-time Medium 3-5 weeks High
Security triage Event-driven, real-time High 6-10 weeks Very high
DevOps/incident Event-driven, real-time High 4-8 weeks High
Research Batch, async Low-medium 1-2 weeks Medium
Personal productivity Real-time Medium 2-4 weeks Medium

Lessons Learned Across All Use Cases

After studying dozens of agent deployments across these domains, several patterns emerge:

What Makes Agent Projects Succeed

  1. Narrow scope, deep execution. The most successful agents do one thing really well. The "do everything" agent always disappoints.

  2. Human-in-the-loop from day one. Every successful deployment started with human oversight and gradually loosened it as trust grew. Nobody went from zero to fully autonomous.

  3. Obsessive measurement. Teams that tracked accuracy, cost, and user satisfaction from day one iterated faster and caught regressions early.

  4. Starting with the most painful task. The best ROI comes from automating the work people actively avoid - code reviews, alert triage, email categorization.

  5. Treating prompts like code. Version control, testing, review processes - teams that applied software engineering practices to their prompts built more reliable agents.

What Makes Agent Projects Fail

  1. Boiling the ocean. Trying to automate an entire workflow at once instead of starting with one step.

  2. Ignoring edge cases. The demo handles the happy path beautifully. Production is 80% edge cases.

  3. No evaluation framework. Without a way to measure agent quality, you can't improve it. You're just guessing.

  4. Over-trusting the model. Assuming the LLM will "figure it out" without careful prompt design, guard rails, and validation.

  5. Underestimating integration work. The agent itself is 20% of the effort. Integrating with existing systems, handling auth, managing state - that's the other 80%.

ROI Framework for Evaluating Agent Projects

Before building an agent, run it through this framework:

ROI Score = (Time Saved × Hourly Cost × Volume) -
            (Build Cost + Running Cost + Maintenance Cost)

Where:
  Time Saved = hours saved per task × tasks per month
  Hourly Cost = cost of the person currently doing this work
  Volume = number of tasks per month
  Build Cost = engineering hours × engineering hourly rate
  Running Cost = LLM API costs + infrastructure costs per month
  Maintenance Cost = ongoing prompt tuning + monitoring + updates

Example: Customer support ticket triage

  • Current: 3 support agents spend 2 hours/day triaging (6 person-hours/day)
  • Agent handles 70% of triage automatically
  • Time saved: 4.2 hours/day = 126 hours/month
  • Cost saved: 126 hours x $20/hour = $2,520/month
  • Agent cost: $400/month (LLM + infra)
  • Build cost: 2 engineers x 4 weeks = ~$40,000 one-time
  • Payback period: ~19 months... unless volume grows, which it almost always does
Note

The ROI calculation above is conservative. It doesn't account for improved consistency, 24/7 availability, or the compounding effect of the agent getting better over time as you tune it. Real ROI is typically 2-3x the simple calculation.

The best agent projects aren't the ones with the most impressive technology - they're the ones that solve a real, painful problem for a specific group of users, with a clear way to measure success. Start there, and the technology will follow.