Skip to content

The Future of AI Agents

We've spent this book building agents, deploying them, and studying how they work in practice. Now let's look ahead. Where is this technology going? What should you build today to be ready for what's coming? And what responsibilities come with building systems that act autonomously in the world?

This isn't a chapter of wild speculation. It's grounded in trajectories that are already visible - capabilities that exist in research labs today and will be in your hands within months, not years.

Where We Are Today: The State of the Art in 2025

Let's take stock. As of mid-2025, this is what's real (not hype):

What works well:

  • Single-domain agents that follow well-defined workflows (customer support, code review, data analysis)
  • Agents with 5-15 tool calls per task, completing in under 2 minutes
  • Human-in-the-loop systems where agents draft and humans approve
  • Agents operating on structured data with clear success criteria

What mostly works (with caveats):

  • Multi-step coding agents that write, test, and debug across multiple files
  • Research agents that synthesize information from many sources
  • Agents that maintain context across long conversations (10k+ tokens of history)

What's still unreliable:

  • Fully autonomous agents running for hours without supervision
  • Agents that handle truly novel situations they weren't designed for
  • Multi-agent systems that coordinate on complex, open-ended tasks
  • Agents that reliably know when they don't know something

The gap between "works in a demo" and "works in production" has narrowed significantly over the past year, but it hasn't closed. The most important skill right now is knowing where agents are reliable enough to deploy and where they're not.

Emerging Capabilities

Better Reasoning

The reasoning capabilities of foundation models are improving on a steep curve. Models that struggled with multi-step logical reasoning in 2023 now handle complex chains of inference reliably. What this means for agents:

  • Fewer steps needed. Better reasoning means agents can plan more effectively upfront, reducing the back-and-forth tool-calling loops that eat time and money.
  • Better error recovery. When a tool call fails, a reasoning-capable agent can figure out why and try a different approach, rather than blindly retrying.
  • More complex tasks become feasible. Tasks that required 30+ agent steps (and frequently failed) become achievable in 10-15 steps with higher reliability.

Longer Context Windows

Context windows have grown from 4K tokens in early 2023 to 200K+ tokens today, with some models supporting over 1M tokens. This changes agent architecture fundamentally:

Before (4K context):    Agent remembers last 2-3 turns.
                        Needs RAG for everything.
                        Loses track of complex tasks.

Now (200K+ context):    Agent holds entire codebase in context.
                        Full conversation history always available.
                        Complex multi-step tasks stay coherent.

Coming (1M+ reliable):  Agent processes entire document repositories.
                        No RAG needed for most use cases.
                        Truly persistent working memory across sessions.
Note

Longer context doesn't eliminate the need for retrieval systems - it changes what they're for. Instead of retrieving basic context the model needs to function, retrieval becomes about finding the right information to focus on. The bottleneck shifts from "can the model see this?" to "should the model pay attention to this?"

Multimodal Agents

Agents that can see, hear, and interact with visual interfaces are moving from research demos to practical tools:

  • Vision + code: An agent that looks at a screenshot of a UI bug, reads the source code, and generates a fix. This is already working in tools like Claude's vision capabilities.
  • Vision + browsing: Agents that navigate websites by looking at them, like a human would, rather than parsing HTML.
  • Audio + action: Agents that join a meeting, listen to the discussion, and create action items, update tickets, or draft follow-up emails.

The multimodal shift matters because it dramatically expands what agents can interact with. Today, agents are limited to systems with APIs. Tomorrow, any software with a screen becomes a tool.

Computer Use Agents

This deserves special attention because it represents a paradigm shift in how agents interact with software.

Traditional agents work through APIs - structured, predictable, fast. Computer use agents interact with software the way humans do: they see the screen, move the mouse, type on the keyboard. This sounds absurd until you realize its implications.

Why it matters:

  1. Universal compatibility. Every piece of software ever built has a GUI. Most don't have APIs. Computer use agents can work with any of them.
  2. No integration work. You don't need to build connectors, parse API docs, or handle authentication flows. The agent uses the software the same way you do.
  3. Legacy system automation. That 20-year-old internal tool with no API? An agent can use it.

Current limitations:

  • Speed: clicking through a UI is 10-100x slower than an API call
  • Reliability: UI elements shift, modals appear, layouts change
  • Security: giving an agent control of your screen is a significant trust decision
# Conceptual: computer use agent navigating a web app
async def file_expense_report(agent, expenses: list):
    await agent.navigate("https://expenses.company.com")
    await agent.click("New Report")

    for expense in expenses:
        await agent.fill_field("Description", expense.description)
        await agent.fill_field("Amount", str(expense.amount))
        await agent.select_dropdown("Category", expense.category)
        await agent.upload_file("Receipt", expense.receipt_path)
        await agent.click("Add Item")

    await agent.click("Submit for Approval")

The trajectory is clear: computer use will get faster, more reliable, and more secure. Within two years, it will be a standard capability that agents use as a fallback when no API is available.

Agent-to-Agent Communication

Today, most agents are isolated - a customer support agent can't ask a billing agent to look something up. That's starting to change.

Standardized Protocols

The industry is converging on standard ways for agents to discover and communicate with each other:

Agent A (Customer Support)
    │
    ├── "I need billing information for customer X"
    │
    ├── Discovery: What agents/tools are available?
    │   └── Registry: billing-agent, inventory-agent, shipping-agent
    │
    ├── Request: POST /agents/billing/query
    │   {
    │       "action": "get_customer_billing",
    │       "customer_id": "X",
    │       "requesting_agent": "support-agent",
    │       "auth_context": { ... }
    │   }
    │
    └── Response: Customer billing summary

The key challenges being worked on:

  • Trust and authorization: Which agents can talk to which? What data can they share?
  • Schema standardization: How do agents describe their capabilities to other agents?
  • Error handling: What happens when Agent B is down or returns garbage?
  • Conversation context: How much context from the original conversation should Agent A share with Agent B?

Agent Marketplaces

Think app stores, but for agent capabilities. You'll publish an agent that's great at data analysis. Someone else publishes one that's great at visualization. A third party's orchestration agent combines them to create end-to-end reporting.

This is still early, but the economics are compelling. Building everything from scratch is expensive. Composing specialized agents is efficient.

The Model Context Protocol (MCP)

MCP deserves specific attention because it addresses one of the most practical problems in agent development: tool integration.

Today, every agent framework has its own way of defining tools. A tool built for LangChain doesn't work in CrewAI. A tool built for the Anthropic API doesn't work with OpenAI's function calling. This fragmentation wastes enormous amounts of developer effort.

MCP standardizes the interface between agents and tools:

┌─────────────────┐     MCP Protocol     ┌─────────────────┐
│                 │◄────────────────────►│                 │
│   Agent Host    │  • Tool discovery    │   MCP Server    │
│   (any client)  │  • Tool invocation   │   (any tool)    │
│                 │  • Resource access   │                 │
└─────────────────┘  • Prompt templates  └─────────────────┘

What MCP provides:

Capability Description
Tool Discovery Agent asks "what tools do you have?" and gets structured schemas
Tool Invocation Standard request/response format for calling any tool
Resource Access Standardized way to expose data (files, DB records, API data)
Prompt Templates Reusable prompt patterns that tools can provide to agents
Sampling Tools can request LLM completions from the host

The practical impact: you build a tool server once, and it works with Claude, GPT, Gemini, open source models - any agent that speaks MCP. Similarly, you build an agent client once, and it can use any MCP-compatible tool.

# MCP server example - exposing a database as a tool
from mcp.server import Server
from mcp.types import Tool

server = Server("database-tools")

@server.tool()
async def query_database(sql: str) -> str:
    """Execute a read-only SQL query against the analytics database.

    Args:
        sql: A SELECT query to run. No mutations allowed.
    """
    # Validate: only SELECT statements
    if not sql.strip().upper().startswith("SELECT"):
        raise ValueError("Only SELECT queries are allowed")

    result = await db.execute(sql)
    return format_results(result)

@server.tool()
async def list_tables() -> str:
    """List all available tables and their schemas."""
    tables = await db.get_table_schemas()
    return json.dumps(tables, indent=2)

MCP is already supported by Claude, and adoption is growing rapidly. If you're building tools for agents, building them as MCP servers is the forward-looking choice.

Autonomous Coding: From Copilot to Colleague

The progression in coding agents tells a larger story about how agent autonomy increases:

2022: Autocomplete      → Finish the line I'm typing
2023: Copilot           → Suggest the next block of code
2024: Agent-assisted    → Write this function, I'll review
2025: Semi-autonomous   → Implement this feature across files, tests included
2026+: Autonomous       → Take this issue, ship the PR, I'll merge if it looks good

We're currently in the "semi-autonomous" phase. Tools like Claude Code can make complex multi-file changes, write tests, and debug issues - but they work best with a human guiding the high-level strategy and reviewing the output.

The next phase - fully autonomous software engineering - requires solving several hard problems:

  1. Specification understanding. Correctly interpreting what a vague issue description actually means in the context of a specific codebase.
  2. Architecture decisions. Making good design choices, not just implementing the obvious solution.
  3. Testing adequacy. Knowing when tests are sufficient, not just that they pass.
  4. Failure recovery. When the approach isn't working, recognizing it and trying something fundamentally different.
Tip

Regardless of how autonomous coding agents become, the skill of precisely specifying what you want - writing clear requirements, providing good examples, defining acceptance criteria - will become more valuable, not less. The bottleneck shifts from "can code" to "can specify."

Agent Operating Systems

Here's a concept that's gaining traction: what if an agent wasn't something you invoked for a specific task, but something that ran continuously in the background, managing your digital life?

Imagine an always-on agent that:

  • Monitors your email, Slack, and calendar
  • Notices when a meeting has no agenda and drafts one from recent context
  • Detects when a deadline is at risk and proactively alerts the team
  • Keeps your task list updated based on what happened in meetings
  • Identifies information you need before you know you need it

This isn't science fiction - the individual capabilities exist today. The challenges are:

  1. Cost: An always-on agent making continuous LLM calls is expensive. This requires efficient model routing and smart triggering (only invoke the expensive model when something important happens).
  2. Privacy: This agent sees everything. The security and privacy implications are enormous.
  3. Control: How do you configure preferences for something this broad? How do you correct it when it gets things wrong?

The first practical implementations will be domain-specific. An "agent OS" for development that monitors your IDE, Git, CI/CD, and project management tools. Or one for sales that monitors CRM, email, and call transcripts. General-purpose agent OSes are further out.

The Trust Problem

The biggest barrier to agent adoption isn't technical - it's trust. And trust is earned, not declared.

How Trust Develops

Phase 1 - Visibility:    "Show me what you would do"
Phase 2 - Suggestion:    "Here's what I recommend, approve?"
Phase 3 - Supervised:    "I'll do it, but you can see everything and override"
Phase 4 - Autonomous:    "I'll handle it, you'll see the results"
Phase 5 - Trusted:       "I've been handling this for months, you stopped checking"

Most organizations are between phases 2 and 3 for most use cases. Moving to phase 4 requires:

  • Audit trails: Every action the agent takes is logged and reviewable.
  • Bounded autonomy: The agent can do X and Y but never Z without asking.
  • Track record: Demonstrable history of correct decisions over weeks/months.
  • Reversibility: Actions can be undone if the agent makes a mistake.

Verification Mechanisms

The field is developing better tools for verifying agent behavior:

  • Constitutional AI: Models that follow explicit rules and can explain why they took each action.
  • Formal verification: Mathematical proofs that an agent won't violate certain constraints (early research, but promising for high-stakes domains).
  • Red teaming: Adversarial testing where humans try to make the agent misbehave.
  • Continuous evaluation: Running eval suites against production traffic to catch quality regressions.

Economic Impact

Let's be direct about what agents mean for work.

Jobs That Will Be Augmented

These roles will use agents as power tools, becoming more productive:

  • Software engineers: Spend less time on boilerplate, more on architecture and design
  • Data analysts: Ask questions in English, get answers in seconds instead of hours
  • Security analysts: Process 10x more alerts with agent-assisted triage
  • Customer support leads: Manage agent fleets instead of agent teams, focus on complex cases
  • Researchers: Synthesize literature in hours instead of weeks

Jobs That Will Be Substantially Automated

Some roles will see significant reduction in headcount:

  • Tier-1 customer support: 40-60% reduction within 3 years for text-based support
  • Manual data entry and processing: 70-80% reduction as agents process documents
  • Basic code generation: Junior-level CRUD development becomes agent territory
  • Report generation: Routine reporting becomes fully automated

The New Roles

Agent deployment creates new jobs that don't exist today:

  • Agent architects: Design multi-agent systems and their interaction patterns
  • Prompt engineers / agent behavior designers: Shape how agents communicate and decide
  • Agent operations / reliability engineers: Monitor, tune, and fix production agents
  • AI safety engineers: Ensure agents behave within defined boundaries
Warning

The economic transition will be uneven. Organizations that adopt agents early will gain significant competitive advantages. Those that wait won't have time to catch up gradually - the gap compounds.

Open Source vs. Proprietary

The agent ecosystem is splitting along familiar lines:

Aspect Open Source Proprietary
Models Llama, Mistral, Qwen Claude, GPT, Gemini
Frameworks LangChain, CrewAI, AutoGen Proprietary internal frameworks
Tools MCP servers, community tools Vendor-specific integrations
Deployment Self-hosted, full control Managed services, less ops burden
Cost Lower per-token, higher ops cost Higher per-token, lower ops cost
Capability Closing the gap, but still behind Leading edge
Privacy Full data control Depends on vendor and deployment

My take: the winning strategy is to build on open standards (MCP, OpenTelemetry) while being model-agnostic. Use the best model for each task today, and be ready to swap when something better comes along. The frameworks and protocols will outlast any specific model.

What You Should Build Today

If you're reading this book and wondering "where do I start?", here's a practical roadmap:

Right now (this month):

  1. Build a single-purpose agent for your most painful repetitive task
  2. Set up proper evaluation metrics from day one
  3. Learn one agent framework deeply (I recommend starting with the Anthropic SDK)

Next quarter:

  1. Deploy that agent to production with proper observability
  2. Build a second agent, integrate them if it makes sense
  3. Implement MCP for your tool interfaces

This year:

  1. Develop an internal platform for agent development at your organization
  2. Build shared tool libraries that any agent can use
  3. Start experimenting with more autonomous agent workflows

What NOT to build:

  • Don't build a "general purpose AI assistant" - the frontier models already do that
  • Don't build your own agent framework unless your needs are genuinely unique
  • Don't build agents for problems that aren't painful enough to justify the maintenance cost

The Ethical Dimension

Building systems that act autonomously in the world comes with responsibilities that go beyond technical correctness.

Transparency

Users should always know when they're interacting with an agent, not a human. This isn't just ethics - in many jurisdictions, it's law. Beyond disclosure, agents should be able to explain their reasoning when asked.

Accountability

When an agent makes a mistake - sends a bad email, writes incorrect code, gives wrong advice - who is responsible? The answer must be clear before you deploy:

  • The organization deploying the agent is responsible for its behavior
  • "The AI did it" is never an acceptable explanation to a customer
  • Every autonomous action should have a clear accountability chain

Bias and Fairness

Agents inherit the biases of their training data and the systems they interact with. A hiring agent trained on historical hiring data will perpetuate historical biases. A lending agent using biased credit models will make biased decisions. Testing for bias isn't optional - it's a core part of agent evaluation.

Data Privacy

Agents process sensitive data - conversations, personal information, business secrets. The principles are clear even if the implementation is complex:

  • Minimize data collection: don't store what you don't need
  • Encrypt data at rest and in transit
  • Respect data residency requirements
  • Give users control over their data
  • Be transparent about what data the agent accesses and retains

Predictions for 2026 and Beyond

Based on current trajectories, here's what I expect:

By end of 2026:

  • At least 50% of software companies will have agents in production
  • MCP or a similar protocol will become the dominant standard for agent-tool interaction
  • Computer use agents will be reliable enough for routine business processes
  • The cost of running an agent will drop by 5-10x from today's levels
  • Multi-agent systems will work reliably for well-defined workflows

By 2028:

  • Agents will write and maintain 30-40% of all new code, with human oversight
  • Most knowledge workers will have at least one always-on agent assistant
  • Agent-to-agent commerce will emerge (your purchasing agent negotiates with a vendor's sales agent)
  • The legal and regulatory framework for autonomous agents will take shape
  • Fully autonomous agents will handle end-to-end processes in narrow domains

The big uncertainty: How fast will models improve at genuine reasoning and planning? This is the rate-limiting factor for agent capability. If reasoning takes another leap comparable to what we saw between GPT-3 and Claude 3.5, many of the "2028" predictions will arrive in 2026.

Final Thoughts: Building Responsibly

We're at the beginning of the agent era. The tools are new, the patterns are still being established, and the potential impact - both positive and negative - is enormous.

If you've read this far, you have the technical knowledge to build agents that work. That's the easy part. The harder part is building agents that should be built - systems that genuinely help people, that fail gracefully when they're wrong, and that respect the autonomy and privacy of the humans they serve.

Here's what I've learned from building and deploying agents across a variety of domains:

  1. Start small, learn fast. Your first agent should be embarrassingly simple. Your tenth will be impressively capable. You can't skip the steps in between.

  2. Measure everything. If you can't measure whether your agent is helping, you don't know whether it's helping. Intuition is not sufficient at scale.

  3. Keep humans in the loop. Not because the technology requires it (increasingly, it doesn't), but because trust is built gradually and mistakes are inevitable.

  4. Build on open standards. The specific models and frameworks will change. The protocols, patterns, and evaluation methodologies will endure.

  5. Take responsibility. When your agent acts in the world, its actions are your actions. Build with the same care you'd give to any system that affects people's lives.

The future of AI agents isn't something that will happen to us. It's something we're building, right now, with every design decision, every deployment, and every line of code. Make it count.

Resources for Continued Learning

Frameworks and Tools

Resource Description URL
Anthropic SDK Official Python/TypeScript SDK for Claude github.com/anthropics/anthropic-sdk-python
Model Context Protocol Open standard for agent-tool communication modelcontextprotocol.io
LangChain Popular agent framework with extensive integrations langchain.com
LangSmith Observability and evaluation platform for agents smith.langchain.com
CrewAI Multi-agent orchestration framework crewai.com
Weights & Biases Weave LLM monitoring and evaluation wandb.ai/weave

Papers and Research

  • "Toolformer" (Schick et al., 2023): Foundational paper on teaching LLMs to use tools
  • "ReAct" (Yao et al., 2022): The reasoning-and-acting framework that underpins most agents
  • "Reflexion" (Shinn et al., 2023): Agents that learn from their mistakes through self-reflection
  • "Voyager" (Wang et al., 2023): Continuous learning agent in open-ended environments
  • "SWE-bench" (Jimenez et al., 2024): Benchmark for evaluating coding agents on real software engineering tasks

Communities

  • Anthropic Discord: Active community of Claude developers and agent builders
  • LangChain Discord: Largest agent developer community
  • r/LocalLLaMA: Open source model community with strong agent focus
  • AI Engineer Summit: Annual conference focused on building with AI (not just research)
  • Latent Space podcast: Excellent coverage of agent development trends

Staying Current

The field moves fast. Here's how to keep up without drowning:

  1. Follow the benchmarks: SWE-bench, GAIA, and AgentBench track agent capabilities over time
  2. Read the changelogs: Model providers publish capability updates - these directly affect what your agents can do
  3. Build side projects: Nothing teaches like building. Pick a small problem, build an agent for it, deploy it, and learn from the real-world behavior
  4. Join a community: The collective knowledge of agent builders is the most valuable resource available. Ask questions, share what you learn, help others

This book is a snapshot in time. The technology will evolve. The principles - measure, iterate, keep humans informed, build with care - will not. Go build something useful.