Skip to content

Multi-Agent Systems

There's a moment in every agent project where you stare at your single-agent implementation and realize it's doing too much. It's researching, writing, fact-checking, formatting, and somehow also trying to manage its own workflow. The result? Mediocre output across the board. Your agent has become the developer who insists on doing frontend, backend, DevOps, and QA all by themselves - technically possible, but nobody's getting their best work.

This is where multi-agent systems come in. Instead of one overloaded agent, you build a team of specialized agents that collaborate, argue, and hold each other accountable. This chapter walks you through the when, why, and how of multi-agent architectures - with enough practical detail to actually build one.

When One Agent Isn't Enough

Before you start splitting your agent into a committee, let's be honest: most tasks don't need multiple agents. A single well-designed agent with good tools can handle an enormous range of work. Multi-agent systems add coordination overhead, debugging complexity, and cost.

So when do you need them?

Signal Example Why Multi-Agent Helps
Conflicting expertise A task needs both deep legal knowledge and creative writing Specialized system prompts and tool sets per agent
Quality through verification High-stakes outputs that need fact-checking A separate agent can verify without anchoring bias
Pipeline workflows Content goes through research → drafting → editing stages Each stage has different requirements and evaluation criteria
Parallel subtasks Analyzing 50 documents simultaneously Fan-out to worker agents, fan-in results
Adversarial robustness Security analysis that benefits from red-team/blue-team thinking Agents deliberately challenge each other's conclusions
Tip

A good heuristic: if you find yourself writing a system prompt longer than 800 words with multiple "modes" or "personas," you're probably cramming multiple agents into one. Split them.

Multi-Agent Architectures

There's no single "right" architecture for multi-agent systems. The choice depends on your workflow's shape, your latency tolerance, and how much coordination you need.

Orchestrator-Worker

The most common pattern. One agent (the orchestrator) decomposes tasks and delegates to specialized worker agents. Think of it as a project manager assigning tickets.

         ┌──────────────┐
         │  Orchestrator │
         └──────┬───────┘
        ┌───────┼───────┐
        ▼       ▼       ▼
   [Researcher] [Writer] [Editor]

Best for: Well-defined pipelines, tasks that decompose cleanly into subtasks.

Peer-to-Peer

Agents communicate directly with each other without a central coordinator. Each agent decides when to pass work to another agent.

Best for: Flexible, exploratory workflows where the sequence isn't predetermined.

Risk: Without coordination, agents can enter infinite loops or duplicate work. You need clear termination conditions.

Hierarchical

Multiple layers of orchestration. A top-level orchestrator delegates to mid-level orchestrators, which manage their own teams of workers. This is the corporate org chart of agent architectures.

Best for: Very complex tasks with natural groupings - like building an entire application where you have separate teams for frontend, backend, and testing.

Blackboard

All agents read from and write to a shared state (the "blackboard"). Any agent can pick up work when it sees something relevant to its expertise. This is event-driven and decoupled.

Best for: Problems where contributions from different specialists need to be integrated iteratively, like collaborative diagnosis.

Architecture Coordination Flexibility Complexity Best Use Case
Orchestrator-Worker Centralized Low-Medium Medium Pipelines, task decomposition
Peer-to-Peer Decentralized High High Exploratory, creative tasks
Hierarchical Multi-level Medium Very High Large, complex projects
Blackboard Shared state High Medium-High Iterative, multi-discipline problems

Communication Protocols

Agents need to talk to each other. How they do it matters more than you'd think.

Message Passing

The simplest approach. Agents send structured messages to each other, typically through the orchestrator.

from dataclasses import dataclass
from typing import Any

@dataclass
class AgentMessage:
    sender: str
    recipient: str
    content: str
    message_type: str  # "task", "result", "feedback", "error"
    metadata: dict[str, Any] = None

    def to_prompt_context(self) -> str:
        return f"[Message from {self.sender}]: {self.content}"

Shared State

All agents read from and write to a common state object. This is simpler to implement but requires careful management to avoid conflicts.

class SharedState:
    def __init__(self):
        self._state = {}
        self._history = []

    def update(self, agent_name: str, key: str, value: Any):
        self._history.append({
            "agent": agent_name,
            "key": key,
            "old_value": self._state.get(key),
            "new_value": value
        })
        self._state[key] = value

    def read(self, key: str) -> Any:
        return self._state.get(key)

Event-Driven

Agents subscribe to events and react when something relevant happens. This is the most decoupled approach and works well when you don't know the execution order upfront.

Note

In practice, most production systems use a hybrid: an orchestrator does message passing for the main workflow, with shared state for context that all agents need (like the original user request or accumulated results).

The Orchestrator Pattern in Depth

Let's build a real orchestrator. This isn't a toy example - it's the pattern I've used in production systems.

import openai
import json
from dataclasses import dataclass, field

@dataclass
class AgentConfig:
    name: str
    system_prompt: str
    model: str = "gpt-4o"
    tools: list = field(default_factory=list)
    max_tokens: int = 4096

class SpecialistAgent:
    def __init__(self, config: AgentConfig):
        self.config = config
        self.client = openai.OpenAI()

    def run(self, task: str, context: str = "") -> str:
        messages = [
            {"role": "system", "content": self.config.system_prompt},
        ]
        if context:
            messages.append({
                "role": "user",
                "content": f"Context from previous agents:\n{context}"
            })
        messages.append({"role": "user", "content": task})

        response = self.client.chat.completions.create(
            model=self.config.model,
            messages=messages,
            max_tokens=self.config.max_tokens,
            tools=self.config.tools or None
        )
        return response.choices[0].message.content

class Orchestrator:
    def __init__(self, agents: dict[str, SpecialistAgent]):
        self.agents = agents
        self.client = openai.OpenAI()
        self.execution_log = []

    def plan(self, user_request: str) -> list[dict]:
        """Ask the LLM to decompose the task into agent assignments."""
        agent_descriptions = "\n".join(
            f"- {name}: {agent.config.system_prompt[:100]}..."
            for name, agent in self.agents.items()
        )

        planning_prompt = f"""You are a task orchestrator. Given a user request,
decompose it into sequential steps and assign each step to the most
appropriate specialist agent.

Available agents:
{agent_descriptions}

Respond with a JSON array of steps:
[{{"agent": "agent_name", "task": "specific task description"}}]

User request: {user_request}"""

        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": planning_prompt}],
            response_format={"type": "json_object"}
        )

        result = json.loads(response.choices[0].message.content)
        return result.get("steps", result.get("plan", []))

    def execute(self, user_request: str) -> dict:
        """Plan and execute the multi-agent workflow."""
        steps = self.plan(user_request)
        accumulated_context = f"Original request: {user_request}\n\n"
        results = {}

        for i, step in enumerate(steps):
            agent_name = step["agent"]
            task = step["task"]

            if agent_name not in self.agents:
                self.execution_log.append({
                    "step": i, "error": f"Unknown agent: {agent_name}"
                })
                continue

            agent = self.agents[agent_name]
            result = agent.run(task, context=accumulated_context)

            results[f"step_{i}_{agent_name}"] = result
            accumulated_context += f"\n--- Output from {agent_name} ---\n{result}\n"
            self.execution_log.append({
                "step": i,
                "agent": agent_name,
                "task": task,
                "output_length": len(result)
            })

        return {
            "results": results,
            "final_output": result,
            "execution_log": self.execution_log
        }

The critical design decision here is accumulated context - each agent receives the outputs from all previous agents. This allows downstream agents to build on (or critique) upstream work.

Warning

Accumulated context grows with every step. For pipelines with many agents, you'll blow through context windows fast. Consider summarizing intermediate outputs or only passing relevant portions to each agent.

Agent Specialization: Why Focused Agents Win

This is counterintuitive if you've been impressed by how capable a single GPT-4 call can be. But specialization matters for agents just like it matters for software microservices.

A researcher agent with a system prompt laser-focused on finding facts, citing sources, and flagging uncertainty will consistently outperform a generalist agent asked to "research and then write." Why?

  1. System prompts have limited attention. The more instructions you pack in, the more the model "forgets" or deprioritizes some of them.
  2. Tool sets can be tailored. A researcher needs web search and document retrieval. A writer needs... nothing but a clear brief. Giving both agents all tools leads to confusion.
  3. Evaluation is clearer. You can evaluate a researcher on factual accuracy and a writer on prose quality. Evaluating a generalist on both simultaneously is a mess.

"A multi-agent system is really just the single-responsibility principle applied to AI."

Debate and Consensus: Agents That Argue

One of the most powerful multi-agent patterns is adversarial collaboration - agents that deliberately challenge each other.

def debate_round(proposition: str, agents: list[SpecialistAgent],
                 rounds: int = 3) -> str:
    """Run a structured debate between agents."""
    transcript = f"Topic: {proposition}\n\n"

    for round_num in range(rounds):
        for agent in agents:
            response = agent.run(
                task=f"Round {round_num + 1}: Given the debate so far, "
                     f"present your strongest argument. Challenge weak "
                     f"points from other participants.",
                context=transcript
            )
            transcript += f"\n[{agent.config.name}, Round {round_num + 1}]:\n"
            transcript += f"{response}\n"

    # Final synthesis by a judge agent
    judge = SpecialistAgent(AgentConfig(
        name="judge",
        system_prompt="You are an impartial judge. Synthesize the debate "
                      "into a balanced conclusion, noting where agents "
                      "agreed and where legitimate disagreements remain."
    ))

    return judge.run("Provide your final judgment.", context=transcript)

This pattern is especially useful for:

  • Code review: One agent writes code, another reviews it for bugs and security issues
  • Risk assessment: Optimistic and pessimistic agents present cases
  • Fact verification: A claim agent and a skeptic agent go back and forth

Practical Example: Content Creation Pipeline

Let's wire up a real pipeline that takes a topic and produces a polished article.

# Define the specialist agents
researcher = SpecialistAgent(AgentConfig(
    name="researcher",
    system_prompt="""You are a research specialist. Given a topic:
    1. Identify key facts, statistics, and expert opinions
    2. Cite sources with URLs where possible
    3. Flag any claims you're uncertain about with [UNVERIFIED]
    4. Organize findings into clear sections
    Do NOT write prose. Provide structured research notes."""
))

writer = SpecialistAgent(AgentConfig(
    name="writer",
    system_prompt="""You are a skilled technical writer. Given research notes:
    1. Write an engaging, well-structured article
    2. Use clear examples and analogies
    3. Maintain an authoritative but conversational tone
    4. Include section headers, bullet points, and code examples where relevant
    Write for a technical audience. No fluff."""
))

editor = SpecialistAgent(AgentConfig(
    name="editor",
    system_prompt="""You are a senior editor. Review the article for:
    1. Clarity and readability (fix jargon, improve flow)
    2. Logical structure (does the argument build properly?)
    3. Technical accuracy (flag anything that seems wrong)
    4. Grammar and style consistency
    Return the improved article with your changes applied."""
))

fact_checker = SpecialistAgent(AgentConfig(
    name="fact_checker",
    system_prompt="""You are a fact-checker. Review the article against
    the original research notes:
    1. Verify all claims are supported by the research
    2. Flag any statements that were added without evidence
    3. Check that statistics and quotes are accurately represented
    4. Mark issues as [FACTUAL ERROR], [UNSUPPORTED], or [NEEDS CONTEXT]
    Return a fact-check report AND the corrected article."""
))

# Create and run the pipeline
pipeline = Orchestrator({
    "researcher": researcher,
    "writer": writer,
    "editor": editor,
    "fact_checker": fact_checker
})

result = pipeline.execute(
    "Write a 1500-word article about the environmental impact of "
    "large language model training"
)

This pipeline consistently produces better output than asking a single agent to "research and write an article about X." The researcher finds better facts because that's all it's doing. The writer produces cleaner prose because it has structured input. The editor catches issues the writer was blind to. And the fact-checker keeps everyone honest.

Handling Conflicts Between Agents

When agents disagree, you need a resolution strategy. Here are the three I've found most practical:

1. Orchestrator as tiebreaker. The orchestrator reviews conflicting outputs and makes a decision. Simple but adds latency.

2. Voting. Run multiple agents on the same task and take the majority answer. Works well for factual questions, poorly for creative tasks.

3. Escalation to human. When agents can't agree and the stakes are high, surface the disagreement to a human reviewer with both perspectives summarized.

def resolve_conflict(outputs: list[str], strategy: str = "orchestrator") -> str:
    if strategy == "vote":
        # Simple majority - works best with structured outputs
        from collections import Counter
        votes = Counter(outputs)
        return votes.most_common(1)[0][0]

    elif strategy == "orchestrator":
        resolver = SpecialistAgent(AgentConfig(
            name="resolver",
            system_prompt="You resolve disagreements between agents. "
                          "Analyze each position, identify the strongest "
                          "arguments, and produce a final answer."
        ))
        conflict_summary = "\n\n".join(
            f"Agent {i+1} says:\n{output}" for i, output in enumerate(outputs)
        )
        return resolver.run("Resolve this disagreement.", context=conflict_summary)

    elif strategy == "escalate":
        return {
            "status": "needs_human_review",
            "conflicting_outputs": outputs
        }

Scaling Multi-Agent Systems

Multi-agent systems are expensive. Every agent call is an LLM invocation, and orchestration adds overhead. Here's how to keep costs manageable:

Strategy Implementation Impact
Use cheaper models for simple agents GPT-4o-mini for extraction, GPT-4o for reasoning 5-10x cost reduction
Parallelize independent steps Use asyncio.gather() for agents that don't depend on each other 2-4x latency reduction
Cache intermediate results Hash inputs and cache agent outputs Eliminates redundant calls
Short-circuit when possible Skip the fact-checker if the article is purely opinion Reduces unnecessary steps
Set token budgets per agent max_tokens=1000 for summarizers, more for writers Prevents runaway costs
import asyncio

async def parallel_research(topics: list[str], researcher: SpecialistAgent):
    """Fan out research tasks in parallel."""
    tasks = [
        asyncio.to_thread(researcher.run, topic)
        for topic in topics
    ]
    return await asyncio.gather(*tasks)
Tip

Profile your pipeline before optimizing. I've seen teams spend days parallelizing agents when the real bottleneck was a single slow tool call. Measure first.

Real-World Multi-Agent Frameworks

You don't have to build everything from scratch. Here's how the major frameworks approach multi-agent:

Framework Architecture Style Strengths Weaknesses Best For
AutoGen (Microsoft) Conversation-based, agents chat Flexible conversation patterns, GroupChat Can be verbose, hard to control flow Research, exploratory tasks
CrewAI Role-based, sequential/parallel Simple API, role abstraction Less flexible for complex graphs Content pipelines, business workflows
LangGraph Graph-based state machine Precise flow control, conditional edges Steeper learning curve Complex workflows with branching
OpenAI Swarm Lightweight handoffs Minimal abstraction, easy to understand Limited built-in coordination Simple agent-to-agent delegation
Autogen Studio Visual builder No-code agent orchestration Less customizable Prototyping, non-developers

My recommendation: start with LangGraph if you need precise control, CrewAI if you want simplicity, and plain code (like the examples in this chapter) if you want to understand what's happening.

Anti-Patterns: What Goes Wrong

The Committee Anti-Pattern. You create 8 agents for a task that one agent could handle. Each agent adds latency, cost, and potential failure points. More agents does not mean better output.

The Telephone Game. Information degrades as it passes through too many agents. By the time the 5th agent sees the original request, it's been summarized and reinterpreted so many times that critical details are lost.

The Infinite Loop. Agent A asks Agent B for clarification. Agent B asks Agent A for more context. Neither has a termination condition. Always set maximum iteration limits.

Over-Specified Orchestration. The orchestrator's plan is so detailed that the specialist agents have no room to apply their expertise. If you're dictating every sentence, you don't need specialists - you need a typist.

# Always set a maximum number of steps
MAX_STEPS = 10

for i, step in enumerate(steps):
    if i >= MAX_STEPS:
        logger.warning(f"Hit max steps ({MAX_STEPS}). Forcing completion.")
        break
    # ... execute step

When NOT to Use Multi-Agent

I want to end with the most important advice in this chapter: default to a single agent.

Use a single agent when:

  • The task is well-defined and doesn't require conflicting perspectives
  • Latency matters more than output quality
  • Your budget is tight (multi-agent systems multiply costs)
  • The task is simple enough that one good system prompt covers it
  • You're prototyping and don't yet know the workflow shape

Use multi-agent when:

  • A single agent consistently fails at part of the task
  • You need adversarial verification (security, compliance, fact-checking)
  • The workflow has natural pipeline stages with different requirements
  • You need to parallelize work across many subtasks
  • Quality is more important than speed or cost

"The best multi-agent system is the one you didn't build because a single agent was enough."

Multi-agent systems are a powerful tool, but they're a tool for specific problems. In the next chapters, we'll look at how to give these agents (single or multi) access to external knowledge through RAG, and how to keep them safe with guardrails.