Skip to content

Retrieval-Augmented Generation for Agents

Here's the fundamental problem with LLMs as agent brains: they only know what was in their training data, and that data has a cutoff date. Ask your agent about your company's refund policy, last quarter's metrics, or the contents of an internal design doc, and it will either hallucinate confidently or politely tell you it doesn't know.

Retrieval-Augmented Generation - RAG - solves this by giving your agent a way to look things up before answering. Instead of relying purely on parametric memory (the model's weights), the agent retrieves relevant documents from an external knowledge base and uses them as context. It's the difference between answering from memory and answering with your notes open in front of you.

This chapter covers everything you need to build a RAG-powered agent, from chunking strategies to production monitoring. We'll go deep on the parts that matter and skip the parts that don't.

RAG Fundamentals: A Quick Refresher

The core idea is straightforward:

  1. User asks a question
  2. Retrieve relevant documents from a knowledge base
  3. Augment the prompt with those documents
  4. Generate a response grounded in the retrieved information

What makes RAG different for agents (versus simple chatbots) is that the agent gets to decide when retrieval is necessary, what query to use, and how to combine information from multiple sources. We'll get to agentic RAG later - first, let's build the foundation.

The RAG Pipeline

Every RAG system has the same basic pipeline, whether you're using it for a chatbot or a fully autonomous agent.

Documents → Chunking → Embedding → Indexing → [stored in vector DB]
                                                        │
User Query → Query Embedding ──────────────────→ Retrieval → Reranking → Generation

Let's break each stage down.

Document Ingestion

Before you can retrieve anything, you need to get your documents into the system. This is the unglamorous part that takes 60% of your time.

from pathlib import Path
import fitz  # PyMuPDF for PDFs
from bs4 import BeautifulSoup
import tiktoken

class DocumentLoader:
    """Load documents from various sources."""

    def load_pdf(self, path: str) -> str:
        doc = fitz.open(path)
        return "\n\n".join(page.get_text() for page in doc)

    def load_html(self, path: str) -> str:
        with open(path) as f:
            soup = BeautifulSoup(f.read(), "html.parser")
        # Remove script and style elements
        for element in soup(["script", "style", "nav", "footer"]):
            element.decompose()
        return soup.get_text(separator="\n", strip=True)

    def load_markdown(self, path: str) -> str:
        return Path(path).read_text()

    def load(self, path: str) -> str:
        suffix = Path(path).suffix.lower()
        loaders = {
            ".pdf": self.load_pdf,
            ".html": self.load_html,
            ".htm": self.load_html,
            ".md": self.load_markdown,
            ".txt": self.load_markdown,  # same loader works
        }
        loader = loaders.get(suffix)
        if not loader:
            raise ValueError(f"Unsupported file type: {suffix}")
        return loader(path)
Warning

PDF extraction is where most RAG quality problems start. Tables get mangled, headers repeat on every page, and multi-column layouts produce gibberish. Always inspect your extracted text before indexing. Tools like unstructured.io or docling handle complex PDFs much better than simple text extraction.

Chunking Strategies

You can't embed entire documents - they're too long, and relevant information gets diluted. You need to split documents into chunks that are small enough to be specific but large enough to be meaningful.

This is the most underrated part of the RAG pipeline. Bad chunking destroys retrieval quality no matter how good your embedding model is.

Strategy How It Works Chunk Size Best For Weakness
Fixed-size Split every N tokens with overlap 256-512 tokens Simple documents, logs Cuts mid-sentence, ignores structure
Recursive character Split by paragraphs → sentences → words Varies General purpose Doesn't understand document semantics
Semantic Group sentences by embedding similarity Varies Long, topically diverse docs Slow, requires embedding calls during chunking
Document-aware Split by headers, sections, or page structure Varies Structured docs (manuals, APIs) Requires format-specific parsers
Sliding window Overlapping fixed windows 256-512 tokens When context spans chunk boundaries Redundant storage, retrieval deduplication needed

My recommendation: start with recursive character splitting at ~500 tokens with 50-token overlap. It's good enough for 80% of use cases. Switch to semantic or document-aware chunking only when retrieval quality plateaus.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""],
    length_function=lambda text: len(
        tiktoken.encoding_for_model("gpt-4o").encode(text)
    ),
)

chunks = splitter.split_text(document_text)
Tip

Always store metadata with your chunks: source document, page number, section header, and position within the document. This metadata is invaluable for debugging retrieval issues and for providing citations in agent responses.

Embedding Models

Embeddings convert text into dense vectors that capture semantic meaning. Similar texts produce similar vectors, which is how retrieval works.

Model Dimensions Max Tokens Relative Quality Speed Cost
OpenAI text-embedding-3-large 3072 8191 Excellent Fast (API) $0.13/M tokens
OpenAI text-embedding-3-small 1536 8191 Very Good Fast (API) $0.02/M tokens
Cohere embed-v3 1024 512 Excellent Fast (API) $0.10/M tokens
Voyage AI voyage-3 1024 32000 Excellent Fast (API) $0.06/M tokens
BGE-large-en-v1.5 1024 512 Good Self-hosted Free
GTE-large 1024 8192 Very Good Self-hosted Free
all-MiniLM-L6-v2 384 256 Decent Very Fast Free

For most production agents, text-embedding-3-small is the sweet spot - good quality, low cost, and fast. Use text-embedding-3-large when you need maximum retrieval accuracy and cost isn't a primary concern.

from openai import OpenAI

client = OpenAI()

def embed_texts(texts: list[str], model: str = "text-embedding-3-small") -> list[list[float]]:
    """Embed a batch of texts."""
    response = client.embeddings.create(input=texts, model=model)
    return [item.embedding for item in response.data]

Vector Databases

You need somewhere to store your embeddings and perform fast similarity search. Here's how the major options compare:

Database Type Hosting Filtering Scalability Best For
Pinecone Managed Cloud only Good metadata filtering Excellent Production, minimal ops
Weaviate Open-source/Cloud Self-hosted or cloud Rich filtering, BM25 hybrid Very Good Hybrid search needs
ChromaDB Open-source In-process or server Basic filtering Limited Prototyping, small datasets
pgvector PostgreSQL extension Wherever Postgres runs Full SQL Good (with HNSW) Teams already on Postgres
Qdrant Open-source/Cloud Self-hosted or cloud Excellent filtering Very Good High-performance, filtering-heavy
Milvus Open-source/Cloud Self-hosted or cloud Good filtering Excellent Very large scale
Note

If you're already running PostgreSQL, start with pgvector. It's one fewer service to manage, and for datasets under 5 million vectors with HNSW indexing, performance is competitive with dedicated vector databases.

Building a RAG-Powered Agent from Scratch

Let's put it all together. This agent answers questions using a knowledge base of documents.

import openai
import numpy as np
from dataclasses import dataclass

@dataclass
class Chunk:
    text: str
    embedding: list[float]
    metadata: dict  # source, page, section, etc.

class RAGAgent:
    def __init__(self, model: str = "gpt-4o"):
        self.client = openai.OpenAI()
        self.model = model
        self.chunks: list[Chunk] = []
        self.system_prompt = """You are a helpful assistant that answers
questions based on the provided context.

Rules:
1. Only answer based on the provided context
2. If the context doesn't contain the answer, say so clearly
3. Cite your sources using [Source: filename, page X] format
4. If you're uncertain, express your uncertainty"""

    def ingest(self, documents: list[dict]):
        """Ingest documents into the knowledge base."""
        from langchain.text_splitter import RecursiveCharacterTextSplitter

        splitter = RecursiveCharacterTextSplitter(
            chunk_size=500, chunk_overlap=50
        )

        all_texts = []
        all_metadata = []

        for doc in documents:
            chunks = splitter.split_text(doc["content"])
            for i, chunk_text in enumerate(chunks):
                all_texts.append(chunk_text)
                all_metadata.append({
                    "source": doc["source"],
                    "chunk_index": i,
                    **doc.get("metadata", {})
                })

        # Batch embed all chunks
        embeddings = self._embed_batch(all_texts)

        for text, embedding, metadata in zip(
            all_texts, embeddings, all_metadata
        ):
            self.chunks.append(Chunk(
                text=text, embedding=embedding, metadata=metadata
            ))

        print(f"Ingested {len(self.chunks)} chunks from "
              f"{len(documents)} documents")

    def _embed_batch(self, texts: list[str]) -> list[list[float]]:
        """Embed texts in batches of 100."""
        all_embeddings = []
        for i in range(0, len(texts), 100):
            batch = texts[i:i+100]
            response = self.client.embeddings.create(
                input=batch, model="text-embedding-3-small"
            )
            all_embeddings.extend(
                [item.embedding for item in response.data]
            )
        return all_embeddings

    def retrieve(self, query: str, top_k: int = 5) -> list[Chunk]:
        """Retrieve the most relevant chunks for a query."""
        query_embedding = self._embed_batch([query])[0]
        query_vec = np.array(query_embedding)

        scored_chunks = []
        for chunk in self.chunks:
            chunk_vec = np.array(chunk.embedding)
            # Cosine similarity
            similarity = np.dot(query_vec, chunk_vec) / (
                np.linalg.norm(query_vec) * np.linalg.norm(chunk_vec)
            )
            scored_chunks.append((similarity, chunk))

        scored_chunks.sort(key=lambda x: x[0], reverse=True)
        return [chunk for _, chunk in scored_chunks[:top_k]]

    def answer(self, question: str) -> str:
        """Answer a question using RAG."""
        relevant_chunks = self.retrieve(question, top_k=5)

        context = "\n\n---\n\n".join(
            f"[Source: {c.metadata['source']}]\n{c.text}"
            for c in relevant_chunks
        )

        messages = [
            {"role": "system", "content": self.system_prompt},
            {"role": "user", "content": f"Context:\n{context}\n\n"
                                        f"Question: {question}"}
        ]

        response = self.client.chat.completions.create(
            model=self.model, messages=messages
        )
        return response.choices[0].message.content

# Usage
agent = RAGAgent()
agent.ingest([
    {"source": "refund_policy.pdf", "content": loaded_pdf_text},
    {"source": "product_manual.md", "content": loaded_manual_text},
])
answer = agent.answer("What's the refund window for enterprise plans?")

This is a complete, working RAG agent. In production, you'd swap the in-memory vector storage for a proper vector database, but the architecture is identical.

Advanced RAG Techniques

Basic RAG gets you 70% of the way. These techniques get you the other 30%.

Hybrid Search

Pure vector search misses exact keyword matches. Pure keyword search misses semantic similarity. Combine them.

def hybrid_search(query: str, chunks: list[Chunk],
                  alpha: float = 0.7) -> list[Chunk]:
    """Combine vector similarity with BM25 keyword matching."""
    # Vector scores
    vector_results = vector_search(query, chunks)
    vector_scores = {id(c): score for score, c in vector_results}

    # BM25 keyword scores
    bm25_results = bm25_search(query, chunks)
    bm25_scores = {id(c): score for score, c in bm25_results}

    # Combine with weighted fusion
    combined = {}
    all_chunks = set(list(vector_scores.keys()) + list(bm25_scores.keys()))

    for chunk_id in all_chunks:
        v_score = vector_scores.get(chunk_id, 0.0)
        b_score = bm25_scores.get(chunk_id, 0.0)
        combined[chunk_id] = alpha * v_score + (1 - alpha) * b_score

    # Sort by combined score
    return sorted(combined.items(), key=lambda x: x[1], reverse=True)

Reranking

Retrieve broadly (top 20-50 results), then rerank with a cross-encoder model that's much more accurate but slower.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, chunks: list[Chunk], top_k: int = 5) -> list[Chunk]:
    pairs = [(query, chunk.text) for chunk in chunks]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(scores, chunks), key=lambda x: x[0], reverse=True)
    return [chunk for _, chunk in ranked[:top_k]]

Query Decomposition

Complex questions often need multiple retrievals. Break them down.

def decompose_query(query: str, client: openai.OpenAI) -> list[str]:
    """Break a complex query into sub-queries for better retrieval."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"""Break this question into 2-4 simpler sub-questions
that would help retrieve relevant information:

Question: {query}

Return a JSON array of strings."""
        }],
        response_format={"type": "json_object"}
    )
    import json
    result = json.loads(response.choices[0].message.content)
    return result.get("questions", result.get("sub_questions", [query]))

Agentic RAG: The Agent Decides

Here's where RAG gets interesting for agents. Instead of retrieving on every query, the agent decides when and how to retrieve. Not every question needs a knowledge base lookup - some can be answered from the conversation context or general knowledge.

# Give the agent retrieval as a tool
tools = [
    {
        "type": "function",
        "function": {
            "name": "search_knowledge_base",
            "description": "Search the internal knowledge base for information. "
                           "Use this when you need specific facts, policies, "
                           "product details, or documentation.",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "The search query - be specific and "
                                       "include key terms"
                    },
                    "filter_source": {
                        "type": "string",
                        "description": "Optional: filter by source document name"
                    }
                },
                "required": ["query"]
            }
        }
    }
]

The key insight: the agent crafts its own retrieval queries. A user might ask "Is the premium plan worth it?" - a vague question. The agent reformulates this into specific retrievals: "premium plan features," "premium plan pricing," "premium vs basic comparison."

Tip

Give your agent a search_knowledge_base tool, not an automatic pre-retrieval step. This lets the agent ask follow-up questions, try different search strategies, and skip retrieval entirely when it's not needed. This is the difference between "RAG" and "Agentic RAG."

Evaluating RAG Quality

You can't improve what you don't measure. Here are the metrics that matter:

Metric What It Measures How to Compute
Faithfulness Does the answer only use info from retrieved docs? LLM-as-judge: compare answer claims to source chunks
Relevance Are the retrieved chunks actually relevant? LLM-as-judge or human annotation of top-k results
Answer correctness Is the answer factually correct? Compare against ground truth Q&A pairs
Completeness Does the answer cover all aspects of the question? LLM-as-judge with rubric
Retrieval recall Did the system find all relevant chunks? Requires labeled relevant chunks per query
Latency How long does retrieval + generation take? Wall clock time, p50/p95/p99
def evaluate_faithfulness(answer: str, context_chunks: list[str],
                          client: openai.OpenAI) -> dict:
    """Check if the answer is grounded in the retrieved context."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": f"""Evaluate whether this answer is faithful to the
provided context. Identify any claims NOT supported by the context.

Context:
{chr(10).join(context_chunks)}

Answer:
{answer}

Respond with JSON: {{"score": 0.0-1.0, "unsupported_claims": [...]}}"""
        }],
        response_format={"type": "json_object"}
    )
    import json
    return json.loads(response.choices[0].message.content)

Common RAG Failures and Debugging

When your RAG agent gives bad answers, the problem is almost always in one of these places:

1. Bad chunks. The relevant information exists but was split across two chunks, so neither chunk alone has enough context. Fix: Increase chunk size or overlap.

2. Embedding mismatch. The query and the relevant chunk use different terminology. The user asks about "canceling a subscription" but the docs say "terminating a plan." Fix: Hybrid search (vector + keyword) or query expansion.

3. Retrieved but not used. The right chunk was retrieved but the LLM ignored it in favor of its parametric knowledge. Fix: Stronger system prompt instructions ("ONLY use the provided context") or put the most relevant chunk closest to the question in the prompt.

4. Not retrieved. The knowledge base simply doesn't contain the answer. Fix: Check your ingestion pipeline. Is the document there? Was it parsed correctly?

5. Too much noise. You're retrieving 20 chunks and most are irrelevant, drowning out the good ones. Fix: Reranking, or reduce top_k.

Warning

The single most common RAG failure I see in production: developers never look at their chunks. They ingest documents and assume the chunking worked. Always manually inspect 20-30 random chunks before going to production. You'll be surprised how often tables are garbled, code blocks are split mid-function, or section headers are separated from their content.

Production RAG: Beyond the Prototype

Caching

Cache at two levels: embedding cache (don't re-embed the same query twice) and result cache (same query, same answer).

import hashlib
from functools import lru_cache

class CachedRAG:
    def __init__(self, agent: RAGAgent):
        self.agent = agent
        self._query_cache = {}

    def answer(self, question: str, cache_ttl: int = 3600) -> str:
        cache_key = hashlib.sha256(question.strip().lower().encode()).hexdigest()

        if cache_key in self._query_cache:
            cached = self._query_cache[cache_key]
            if time.time() - cached["timestamp"] < cache_ttl:
                return cached["answer"]

        answer = self.agent.answer(question)
        self._query_cache[cache_key] = {
            "answer": answer, "timestamp": time.time()
        }
        return answer

Index Maintenance

Your knowledge base isn't static. Documents get updated, new ones are added, old ones are removed. You need a strategy:

  • Incremental updates: Track document hashes. Re-embed only changed documents.
  • Scheduled re-indexing: Full rebuild nightly or weekly, depending on change frequency.
  • Version tracking: Keep the previous index version so you can roll back if a new ingestion introduces problems.

Monitoring

Log every retrieval. You want to see:

  • What queries are hitting the knowledge base
  • What's being retrieved (and what's not)
  • Similarity scores - are they trending lower? Your index might be stale.
  • Latency breakdowns: embedding time, search time, generation time

Case Study: Customer Support Agent with RAG

Let's bring it together with a real-world scenario. You're building a support agent for a SaaS product with 200 pages of documentation.

class SupportAgent:
    def __init__(self):
        self.rag = RAGAgent(model="gpt-4o")
        self.rag.system_prompt = """You are a customer support agent for
Acme SaaS. Answer questions using ONLY the provided documentation context.

Guidelines:
- Be helpful and empathetic
- If you don't know, say "I don't have information about that in our
  documentation. Let me connect you with a human agent."
- Always cite the relevant doc section
- For billing questions, provide the info but suggest contacting
  billing@acme.com for account-specific changes
- Never make up features or pricing"""

        # Ingest all documentation
        self._load_docs()

    def _load_docs(self):
        loader = DocumentLoader()
        docs = []
        doc_dir = Path("./knowledge_base")
        for file_path in doc_dir.glob("**/*"):
            if file_path.suffix in [".md", ".pdf", ".html"]:
                docs.append({
                    "source": file_path.name,
                    "content": loader.load(str(file_path)),
                    "metadata": {
                        "category": file_path.parent.name,
                        "last_updated": file_path.stat().st_mtime
                    }
                })
        self.rag.ingest(docs)

    def handle_query(self, user_message: str,
                     conversation_history: list = None) -> dict:
        """Handle a support query with RAG."""
        # Retrieve relevant docs
        chunks = self.rag.retrieve(user_message, top_k=5)

        # Check retrieval confidence
        if not chunks or all(
            c.metadata.get("similarity", 0) < 0.3 for c in chunks
        ):
            return {
                "answer": "I couldn't find relevant information in our "
                          "documentation. Let me connect you with a "
                          "human agent who can help.",
                "escalate": True,
                "sources": []
            }

        answer = self.rag.answer(user_message)
        return {
            "answer": answer,
            "escalate": False,
            "sources": [c.metadata["source"] for c in chunks]
        }

This agent handles the happy path (answer from docs), the failure path (escalate to human), and provides source citations for transparency. It's not fancy, but it works - and it's what 90% of production RAG agents look like under the hood.

The key takeaway from this chapter: RAG quality is determined by your ingestion pipeline, not your LLM. Spend 80% of your time on chunking, embedding selection, and retrieval tuning. The generation step is the easy part.