Retrieval-Augmented Generation for Agents
Here's the fundamental problem with LLMs as agent brains: they only know what was in their training data, and that data has a cutoff date. Ask your agent about your company's refund policy, last quarter's metrics, or the contents of an internal design doc, and it will either hallucinate confidently or politely tell you it doesn't know.
Retrieval-Augmented Generation - RAG - solves this by giving your agent a way to look things up before answering. Instead of relying purely on parametric memory (the model's weights), the agent retrieves relevant documents from an external knowledge base and uses them as context. It's the difference between answering from memory and answering with your notes open in front of you.
This chapter covers everything you need to build a RAG-powered agent, from chunking strategies to production monitoring. We'll go deep on the parts that matter and skip the parts that don't.
RAG Fundamentals: A Quick Refresher
The core idea is straightforward:
- User asks a question
- Retrieve relevant documents from a knowledge base
- Augment the prompt with those documents
- Generate a response grounded in the retrieved information
What makes RAG different for agents (versus simple chatbots) is that the agent gets to decide when retrieval is necessary, what query to use, and how to combine information from multiple sources. We'll get to agentic RAG later - first, let's build the foundation.
The RAG Pipeline
Every RAG system has the same basic pipeline, whether you're using it for a chatbot or a fully autonomous agent.
Documents → Chunking → Embedding → Indexing → [stored in vector DB]
│
User Query → Query Embedding ──────────────────→ Retrieval → Reranking → Generation
Let's break each stage down.
Document Ingestion
Before you can retrieve anything, you need to get your documents into the system. This is the unglamorous part that takes 60% of your time.
from pathlib import Path
import fitz # PyMuPDF for PDFs
from bs4 import BeautifulSoup
import tiktoken
class DocumentLoader:
"""Load documents from various sources."""
def load_pdf(self, path: str) -> str:
doc = fitz.open(path)
return "\n\n".join(page.get_text() for page in doc)
def load_html(self, path: str) -> str:
with open(path) as f:
soup = BeautifulSoup(f.read(), "html.parser")
# Remove script and style elements
for element in soup(["script", "style", "nav", "footer"]):
element.decompose()
return soup.get_text(separator="\n", strip=True)
def load_markdown(self, path: str) -> str:
return Path(path).read_text()
def load(self, path: str) -> str:
suffix = Path(path).suffix.lower()
loaders = {
".pdf": self.load_pdf,
".html": self.load_html,
".htm": self.load_html,
".md": self.load_markdown,
".txt": self.load_markdown, # same loader works
}
loader = loaders.get(suffix)
if not loader:
raise ValueError(f"Unsupported file type: {suffix}")
return loader(path)
PDF extraction is where most RAG quality problems start. Tables get mangled, headers repeat on every page, and multi-column layouts produce gibberish. Always inspect your extracted text before indexing. Tools like unstructured.io or docling handle complex PDFs much better than simple text extraction.
Chunking Strategies
You can't embed entire documents - they're too long, and relevant information gets diluted. You need to split documents into chunks that are small enough to be specific but large enough to be meaningful.
This is the most underrated part of the RAG pipeline. Bad chunking destroys retrieval quality no matter how good your embedding model is.
| Strategy | How It Works | Chunk Size | Best For | Weakness |
|---|---|---|---|---|
| Fixed-size | Split every N tokens with overlap | 256-512 tokens | Simple documents, logs | Cuts mid-sentence, ignores structure |
| Recursive character | Split by paragraphs → sentences → words | Varies | General purpose | Doesn't understand document semantics |
| Semantic | Group sentences by embedding similarity | Varies | Long, topically diverse docs | Slow, requires embedding calls during chunking |
| Document-aware | Split by headers, sections, or page structure | Varies | Structured docs (manuals, APIs) | Requires format-specific parsers |
| Sliding window | Overlapping fixed windows | 256-512 tokens | When context spans chunk boundaries | Redundant storage, retrieval deduplication needed |
My recommendation: start with recursive character splitting at ~500 tokens with 50-token overlap. It's good enough for 80% of use cases. Switch to semantic or document-aware chunking only when retrieval quality plateaus.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " ", ""],
length_function=lambda text: len(
tiktoken.encoding_for_model("gpt-4o").encode(text)
),
)
chunks = splitter.split_text(document_text)
Always store metadata with your chunks: source document, page number, section header, and position within the document. This metadata is invaluable for debugging retrieval issues and for providing citations in agent responses.
Embedding Models
Embeddings convert text into dense vectors that capture semantic meaning. Similar texts produce similar vectors, which is how retrieval works.
| Model | Dimensions | Max Tokens | Relative Quality | Speed | Cost |
|---|---|---|---|---|---|
| OpenAI text-embedding-3-large | 3072 | 8191 | Excellent | Fast (API) | $0.13/M tokens |
| OpenAI text-embedding-3-small | 1536 | 8191 | Very Good | Fast (API) | $0.02/M tokens |
| Cohere embed-v3 | 1024 | 512 | Excellent | Fast (API) | $0.10/M tokens |
| Voyage AI voyage-3 | 1024 | 32000 | Excellent | Fast (API) | $0.06/M tokens |
| BGE-large-en-v1.5 | 1024 | 512 | Good | Self-hosted | Free |
| GTE-large | 1024 | 8192 | Very Good | Self-hosted | Free |
| all-MiniLM-L6-v2 | 384 | 256 | Decent | Very Fast | Free |
For most production agents, text-embedding-3-small is the sweet spot - good quality, low cost, and fast. Use text-embedding-3-large when you need maximum retrieval accuracy and cost isn't a primary concern.
from openai import OpenAI
client = OpenAI()
def embed_texts(texts: list[str], model: str = "text-embedding-3-small") -> list[list[float]]:
"""Embed a batch of texts."""
response = client.embeddings.create(input=texts, model=model)
return [item.embedding for item in response.data]
Vector Databases
You need somewhere to store your embeddings and perform fast similarity search. Here's how the major options compare:
| Database | Type | Hosting | Filtering | Scalability | Best For |
|---|---|---|---|---|---|
| Pinecone | Managed | Cloud only | Good metadata filtering | Excellent | Production, minimal ops |
| Weaviate | Open-source/Cloud | Self-hosted or cloud | Rich filtering, BM25 hybrid | Very Good | Hybrid search needs |
| ChromaDB | Open-source | In-process or server | Basic filtering | Limited | Prototyping, small datasets |
| pgvector | PostgreSQL extension | Wherever Postgres runs | Full SQL | Good (with HNSW) | Teams already on Postgres |
| Qdrant | Open-source/Cloud | Self-hosted or cloud | Excellent filtering | Very Good | High-performance, filtering-heavy |
| Milvus | Open-source/Cloud | Self-hosted or cloud | Good filtering | Excellent | Very large scale |
If you're already running PostgreSQL, start with pgvector. It's one fewer service to manage, and for datasets under 5 million vectors with HNSW indexing, performance is competitive with dedicated vector databases.
Building a RAG-Powered Agent from Scratch
Let's put it all together. This agent answers questions using a knowledge base of documents.
import openai
import numpy as np
from dataclasses import dataclass
@dataclass
class Chunk:
text: str
embedding: list[float]
metadata: dict # source, page, section, etc.
class RAGAgent:
def __init__(self, model: str = "gpt-4o"):
self.client = openai.OpenAI()
self.model = model
self.chunks: list[Chunk] = []
self.system_prompt = """You are a helpful assistant that answers
questions based on the provided context.
Rules:
1. Only answer based on the provided context
2. If the context doesn't contain the answer, say so clearly
3. Cite your sources using [Source: filename, page X] format
4. If you're uncertain, express your uncertainty"""
def ingest(self, documents: list[dict]):
"""Ingest documents into the knowledge base."""
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500, chunk_overlap=50
)
all_texts = []
all_metadata = []
for doc in documents:
chunks = splitter.split_text(doc["content"])
for i, chunk_text in enumerate(chunks):
all_texts.append(chunk_text)
all_metadata.append({
"source": doc["source"],
"chunk_index": i,
**doc.get("metadata", {})
})
# Batch embed all chunks
embeddings = self._embed_batch(all_texts)
for text, embedding, metadata in zip(
all_texts, embeddings, all_metadata
):
self.chunks.append(Chunk(
text=text, embedding=embedding, metadata=metadata
))
print(f"Ingested {len(self.chunks)} chunks from "
f"{len(documents)} documents")
def _embed_batch(self, texts: list[str]) -> list[list[float]]:
"""Embed texts in batches of 100."""
all_embeddings = []
for i in range(0, len(texts), 100):
batch = texts[i:i+100]
response = self.client.embeddings.create(
input=batch, model="text-embedding-3-small"
)
all_embeddings.extend(
[item.embedding for item in response.data]
)
return all_embeddings
def retrieve(self, query: str, top_k: int = 5) -> list[Chunk]:
"""Retrieve the most relevant chunks for a query."""
query_embedding = self._embed_batch([query])[0]
query_vec = np.array(query_embedding)
scored_chunks = []
for chunk in self.chunks:
chunk_vec = np.array(chunk.embedding)
# Cosine similarity
similarity = np.dot(query_vec, chunk_vec) / (
np.linalg.norm(query_vec) * np.linalg.norm(chunk_vec)
)
scored_chunks.append((similarity, chunk))
scored_chunks.sort(key=lambda x: x[0], reverse=True)
return [chunk for _, chunk in scored_chunks[:top_k]]
def answer(self, question: str) -> str:
"""Answer a question using RAG."""
relevant_chunks = self.retrieve(question, top_k=5)
context = "\n\n---\n\n".join(
f"[Source: {c.metadata['source']}]\n{c.text}"
for c in relevant_chunks
)
messages = [
{"role": "system", "content": self.system_prompt},
{"role": "user", "content": f"Context:\n{context}\n\n"
f"Question: {question}"}
]
response = self.client.chat.completions.create(
model=self.model, messages=messages
)
return response.choices[0].message.content
# Usage
agent = RAGAgent()
agent.ingest([
{"source": "refund_policy.pdf", "content": loaded_pdf_text},
{"source": "product_manual.md", "content": loaded_manual_text},
])
answer = agent.answer("What's the refund window for enterprise plans?")
This is a complete, working RAG agent. In production, you'd swap the in-memory vector storage for a proper vector database, but the architecture is identical.
Advanced RAG Techniques
Basic RAG gets you 70% of the way. These techniques get you the other 30%.
Hybrid Search
Pure vector search misses exact keyword matches. Pure keyword search misses semantic similarity. Combine them.
def hybrid_search(query: str, chunks: list[Chunk],
alpha: float = 0.7) -> list[Chunk]:
"""Combine vector similarity with BM25 keyword matching."""
# Vector scores
vector_results = vector_search(query, chunks)
vector_scores = {id(c): score for score, c in vector_results}
# BM25 keyword scores
bm25_results = bm25_search(query, chunks)
bm25_scores = {id(c): score for score, c in bm25_results}
# Combine with weighted fusion
combined = {}
all_chunks = set(list(vector_scores.keys()) + list(bm25_scores.keys()))
for chunk_id in all_chunks:
v_score = vector_scores.get(chunk_id, 0.0)
b_score = bm25_scores.get(chunk_id, 0.0)
combined[chunk_id] = alpha * v_score + (1 - alpha) * b_score
# Sort by combined score
return sorted(combined.items(), key=lambda x: x[1], reverse=True)
Reranking
Retrieve broadly (top 20-50 results), then rerank with a cross-encoder model that's much more accurate but slower.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query: str, chunks: list[Chunk], top_k: int = 5) -> list[Chunk]:
pairs = [(query, chunk.text) for chunk in chunks]
scores = reranker.predict(pairs)
ranked = sorted(zip(scores, chunks), key=lambda x: x[0], reverse=True)
return [chunk for _, chunk in ranked[:top_k]]
Query Decomposition
Complex questions often need multiple retrievals. Break them down.
def decompose_query(query: str, client: openai.OpenAI) -> list[str]:
"""Break a complex query into sub-queries for better retrieval."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"""Break this question into 2-4 simpler sub-questions
that would help retrieve relevant information:
Question: {query}
Return a JSON array of strings."""
}],
response_format={"type": "json_object"}
)
import json
result = json.loads(response.choices[0].message.content)
return result.get("questions", result.get("sub_questions", [query]))
Agentic RAG: The Agent Decides
Here's where RAG gets interesting for agents. Instead of retrieving on every query, the agent decides when and how to retrieve. Not every question needs a knowledge base lookup - some can be answered from the conversation context or general knowledge.
# Give the agent retrieval as a tool
tools = [
{
"type": "function",
"function": {
"name": "search_knowledge_base",
"description": "Search the internal knowledge base for information. "
"Use this when you need specific facts, policies, "
"product details, or documentation.",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query - be specific and "
"include key terms"
},
"filter_source": {
"type": "string",
"description": "Optional: filter by source document name"
}
},
"required": ["query"]
}
}
}
]
The key insight: the agent crafts its own retrieval queries. A user might ask "Is the premium plan worth it?" - a vague question. The agent reformulates this into specific retrievals: "premium plan features," "premium plan pricing," "premium vs basic comparison."
Give your agent a search_knowledge_base tool, not an automatic pre-retrieval step. This lets the agent ask follow-up questions, try different search strategies, and skip retrieval entirely when it's not needed. This is the difference between "RAG" and "Agentic RAG."
Evaluating RAG Quality
You can't improve what you don't measure. Here are the metrics that matter:
| Metric | What It Measures | How to Compute |
|---|---|---|
| Faithfulness | Does the answer only use info from retrieved docs? | LLM-as-judge: compare answer claims to source chunks |
| Relevance | Are the retrieved chunks actually relevant? | LLM-as-judge or human annotation of top-k results |
| Answer correctness | Is the answer factually correct? | Compare against ground truth Q&A pairs |
| Completeness | Does the answer cover all aspects of the question? | LLM-as-judge with rubric |
| Retrieval recall | Did the system find all relevant chunks? | Requires labeled relevant chunks per query |
| Latency | How long does retrieval + generation take? | Wall clock time, p50/p95/p99 |
def evaluate_faithfulness(answer: str, context_chunks: list[str],
client: openai.OpenAI) -> dict:
"""Check if the answer is grounded in the retrieved context."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": f"""Evaluate whether this answer is faithful to the
provided context. Identify any claims NOT supported by the context.
Context:
{chr(10).join(context_chunks)}
Answer:
{answer}
Respond with JSON: {{"score": 0.0-1.0, "unsupported_claims": [...]}}"""
}],
response_format={"type": "json_object"}
)
import json
return json.loads(response.choices[0].message.content)
Common RAG Failures and Debugging
When your RAG agent gives bad answers, the problem is almost always in one of these places:
1. Bad chunks. The relevant information exists but was split across two chunks, so neither chunk alone has enough context. Fix: Increase chunk size or overlap.
2. Embedding mismatch. The query and the relevant chunk use different terminology. The user asks about "canceling a subscription" but the docs say "terminating a plan." Fix: Hybrid search (vector + keyword) or query expansion.
3. Retrieved but not used. The right chunk was retrieved but the LLM ignored it in favor of its parametric knowledge. Fix: Stronger system prompt instructions ("ONLY use the provided context") or put the most relevant chunk closest to the question in the prompt.
4. Not retrieved. The knowledge base simply doesn't contain the answer. Fix: Check your ingestion pipeline. Is the document there? Was it parsed correctly?
5. Too much noise. You're retrieving 20 chunks and most are irrelevant, drowning out the good ones. Fix: Reranking, or reduce top_k.
The single most common RAG failure I see in production: developers never look at their chunks. They ingest documents and assume the chunking worked. Always manually inspect 20-30 random chunks before going to production. You'll be surprised how often tables are garbled, code blocks are split mid-function, or section headers are separated from their content.
Production RAG: Beyond the Prototype
Caching
Cache at two levels: embedding cache (don't re-embed the same query twice) and result cache (same query, same answer).
import hashlib
from functools import lru_cache
class CachedRAG:
def __init__(self, agent: RAGAgent):
self.agent = agent
self._query_cache = {}
def answer(self, question: str, cache_ttl: int = 3600) -> str:
cache_key = hashlib.sha256(question.strip().lower().encode()).hexdigest()
if cache_key in self._query_cache:
cached = self._query_cache[cache_key]
if time.time() - cached["timestamp"] < cache_ttl:
return cached["answer"]
answer = self.agent.answer(question)
self._query_cache[cache_key] = {
"answer": answer, "timestamp": time.time()
}
return answer
Index Maintenance
Your knowledge base isn't static. Documents get updated, new ones are added, old ones are removed. You need a strategy:
- Incremental updates: Track document hashes. Re-embed only changed documents.
- Scheduled re-indexing: Full rebuild nightly or weekly, depending on change frequency.
- Version tracking: Keep the previous index version so you can roll back if a new ingestion introduces problems.
Monitoring
Log every retrieval. You want to see:
- What queries are hitting the knowledge base
- What's being retrieved (and what's not)
- Similarity scores - are they trending lower? Your index might be stale.
- Latency breakdowns: embedding time, search time, generation time
Case Study: Customer Support Agent with RAG
Let's bring it together with a real-world scenario. You're building a support agent for a SaaS product with 200 pages of documentation.
class SupportAgent:
def __init__(self):
self.rag = RAGAgent(model="gpt-4o")
self.rag.system_prompt = """You are a customer support agent for
Acme SaaS. Answer questions using ONLY the provided documentation context.
Guidelines:
- Be helpful and empathetic
- If you don't know, say "I don't have information about that in our
documentation. Let me connect you with a human agent."
- Always cite the relevant doc section
- For billing questions, provide the info but suggest contacting
billing@acme.com for account-specific changes
- Never make up features or pricing"""
# Ingest all documentation
self._load_docs()
def _load_docs(self):
loader = DocumentLoader()
docs = []
doc_dir = Path("./knowledge_base")
for file_path in doc_dir.glob("**/*"):
if file_path.suffix in [".md", ".pdf", ".html"]:
docs.append({
"source": file_path.name,
"content": loader.load(str(file_path)),
"metadata": {
"category": file_path.parent.name,
"last_updated": file_path.stat().st_mtime
}
})
self.rag.ingest(docs)
def handle_query(self, user_message: str,
conversation_history: list = None) -> dict:
"""Handle a support query with RAG."""
# Retrieve relevant docs
chunks = self.rag.retrieve(user_message, top_k=5)
# Check retrieval confidence
if not chunks or all(
c.metadata.get("similarity", 0) < 0.3 for c in chunks
):
return {
"answer": "I couldn't find relevant information in our "
"documentation. Let me connect you with a "
"human agent who can help.",
"escalate": True,
"sources": []
}
answer = self.rag.answer(user_message)
return {
"answer": answer,
"escalate": False,
"sources": [c.metadata["source"] for c in chunks]
}
This agent handles the happy path (answer from docs), the failure path (escalate to human), and provides source citations for transparency. It's not fancy, but it works - and it's what 90% of production RAG agents look like under the hood.
The key takeaway from this chapter: RAG quality is determined by your ingestion pipeline, not your LLM. Spend 80% of your time on chunking, embedding selection, and retrieval tuning. The generation step is the easy part.