Memory and Context Management
Here's the dirty secret about LLMs: they have no memory. Every single API call starts from a blank slate. The model doesn't remember your last conversation, your preferences, or even what it said three messages ago - unless you explicitly feed that history back in. This is fundamentally different from how humans think, and it creates a genuine engineering challenge when building agents that need to maintain continuity over time.
If tools are the agent's hands and feet (Chapter 4), memory is its hippocampus. Without it, every interaction is a first date - polite but superficial. With well-designed memory, your agent builds relationships, accumulates knowledge, and gets better at its job over time.
In this chapter, we'll build a complete memory system from the ground up, covering everything from managing the context window to building persistent memory that survives across sessions.
The Memory Problem
Let's be concrete about what we're dealing with. When you call an LLM API, you send a list of messages. The model processes them all at once and generates a response. Then it forgets everything. The next call is completely independent.
# Call 1: The model knows about this exchange
response = client.chat.completions.create(
messages=[
{"role": "user", "content": "My name is Deepak."},
]
)
# response: "Nice to meet you, Deepak!"
# Call 2: The model has no idea who Deepak is
response = client.chat.completions.create(
messages=[
{"role": "user", "content": "What's my name?"},
]
)
# response: "I don't know your name. Could you tell me?"
To make the second call work, you'd need to include the entire history:
response = client.chat.completions.create(
messages=[
{"role": "user", "content": "My name is Deepak."},
{"role": "assistant", "content": "Nice to meet you, Deepak!"},
{"role": "user", "content": "What's my name?"},
]
)
# response: "Your name is Deepak!"
This is fine for short conversations. But what happens when you have 200 messages? Or 2,000? Or when the agent has been running for six months and has accumulated gigabytes of interaction history? You hit the context window limit, and you hit it hard.
Types of Memory
To solve this, we borrow concepts from cognitive science and database design. Agent memory breaks down into three distinct tiers, each serving a different purpose.
Working Memory (Context Window)
This is what the model can see right now - the messages in the current API call. It's fast, accurate, and severely limited. Think of it as your desk: everything on it is immediately accessible, but you can only fit so much.
| Model | Context Window | Approximate Token Budget |
|---|---|---|
| GPT-4o | 128K tokens | ~96K pages of text |
| Claude (Sonnet/Opus) | 200K tokens | ~150K pages of text |
| Gemini 1.5 Pro | 2M tokens | ~1.5M pages of text |
| Llama 3 (70B) | 128K tokens | ~96K pages of text |
These numbers look generous until you realize that tool definitions, system prompts, and conversation history all eat into the budget. In practice, you often have 60-70% of the window available for actual conversation content.
Short-Term Memory (Conversation)
This is the full conversation history, including parts that have been pruned from the context window. It persists for the duration of a session or task but not beyond. Think of it as your notebook - you write things down as you go, but you put the notebook away when the project ends.
Long-Term Memory (Persistent)
This is information that survives across sessions, days, and months. User preferences, learned facts, past interactions, accumulated knowledge. Think of it as your filing cabinet or your personal knowledge base - organized, searchable, and always available.
Most agent tutorials stop at short-term memory. That's like building a colleague who forgets everything every time they leave the room. Long-term memory is what makes an agent genuinely useful over time.
Working Memory Management
The context window is your most precious resource. Every token matters. Here are the strategies you need for managing it effectively.
Message Pruning
The simplest approach: when the conversation gets too long, remove old messages. But be smarter than just chopping from the front.
from dataclasses import dataclass
from typing import Optional
import tiktoken
@dataclass
class Message:
role: str
content: str
timestamp: float
importance: float = 0.5 # 0-1 scale
token_count: Optional[int] = None
def count_tokens(self, model: str = "gpt-4o") -> int:
if self.token_count is None:
enc = tiktoken.encoding_for_model(model)
self.token_count = len(enc.encode(self.content))
return self.token_count
class ContextWindowManager:
"""Manages messages to fit within the context window."""
def __init__(self, max_tokens: int = 120000, reserved_tokens: int = 20000):
self.max_tokens = max_tokens
self.reserved_tokens = reserved_tokens # For system prompt + tools + response
self.available_tokens = max_tokens - reserved_tokens
def fit_messages(self, messages: list[Message]) -> list[Message]:
"""Select messages that fit in the context window."""
total_tokens = sum(m.count_tokens() for m in messages)
if total_tokens <= self.available_tokens:
return messages # Everything fits
# Always keep: system message (first) and recent messages (last N)
system_msg = messages[0] if messages[0].role == "system" else None
recent_count = min(10, len(messages))
recent = messages[-recent_count:]
middle = messages[1:-recent_count] if system_msg else messages[:-recent_count]
# Score middle messages by recency and importance
for i, msg in enumerate(middle):
recency_score = i / len(middle) # Higher = more recent
msg._priority = 0.4 * recency_score + 0.6 * msg.importance
# Sort by priority (keep highest)
middle.sort(key=lambda m: m._priority, reverse=True)
# Greedily add messages until budget is exhausted
budget = self.available_tokens
kept = []
if system_msg:
budget -= system_msg.count_tokens()
kept.append(system_msg)
for msg in recent:
budget -= msg.count_tokens()
for msg in middle:
cost = msg.count_tokens()
if cost <= budget:
kept.append(msg)
budget -= cost
# Reconstruct in chronological order
kept.extend(recent)
kept.sort(key=lambda m: m.timestamp)
return kept
Summarization
Instead of just dropping messages, summarize them. This preserves information density while reducing token count.
async def summarize_conversation(
client, messages: list[Message], max_summary_tokens: int = 500
) -> str:
"""Summarize a block of conversation into a concise paragraph."""
conversation_text = "\n".join(
f"{m.role}: {m.content}" for m in messages
)
response = await client.chat.completions.create(
model="gpt-4o-mini", # Use a cheaper model for summarization
messages=[
{
"role": "system",
"content": "Summarize the following conversation concisely. Preserve key facts, decisions, and action items. Be specific - include names, numbers, and dates."
},
{"role": "user", "content": conversation_text}
],
max_tokens=max_summary_tokens
)
return response.choices[0].message.content
class SummarizingContextManager(ContextWindowManager):
"""Context manager that summarizes old messages instead of dropping them."""
def __init__(self, client, **kwargs):
super().__init__(**kwargs)
self.client = client
self.summary_buffer: Optional[str] = None
async def fit_messages_with_summary(self, messages: list[Message]) -> list[Message]:
total_tokens = sum(m.count_tokens() for m in messages)
if total_tokens <= self.available_tokens:
return messages
# Split: summarize the first half, keep the second half verbatim
split_point = len(messages) // 2
to_summarize = messages[:split_point]
to_keep = messages[split_point:]
summary = await summarize_conversation(self.client, to_summarize)
self.summary_buffer = summary
summary_message = Message(
role="system",
content=f"Summary of earlier conversation:\n{summary}",
timestamp=to_summarize[0].timestamp,
importance=0.9
)
return [summary_message] + to_keep
Use a cheaper, faster model for summarization (like GPT-4o-mini or Claude Haiku). You're doing an internal operation, not user-facing generation. Save your expensive model tokens for the actual reasoning.
Short-Term Memory: Conversation Patterns
Beyond simple message lists, there are several architectural patterns for managing conversation-level memory.
Sliding Window
Keep the last N messages. Simple, predictable, but loses early context:
class SlidingWindowMemory:
def __init__(self, window_size: int = 20):
self.window_size = window_size
self.messages: list[Message] = []
def add(self, message: Message):
self.messages.append(message)
def get_context(self) -> list[Message]:
return self.messages[-self.window_size:]
Summary Chain
Maintain a running summary that gets updated as the conversation progresses:
class SummaryChainMemory:
def __init__(self, client, recent_window: int = 10):
self.client = client
self.recent_window = recent_window
self.messages: list[Message] = []
self.running_summary: str = ""
async def add(self, message: Message):
self.messages.append(message)
# When we accumulate enough messages beyond the window, summarize
if len(self.messages) > self.recent_window * 2:
old = self.messages[:self.recent_window]
combined = f"Previous summary: {self.running_summary}\n\nNew messages:\n"
combined += "\n".join(f"{m.role}: {m.content}" for m in old)
self.running_summary = await summarize_conversation(
self.client, old, max_summary_tokens=300
)
self.messages = self.messages[self.recent_window:]
def get_context(self) -> list[dict]:
context = []
if self.running_summary:
context.append({
"role": "system",
"content": f"Conversation history summary: {self.running_summary}"
})
context.extend([{"role": m.role, "content": m.content} for m in self.messages])
return context
Entity Memory
Track specific entities (people, projects, concepts) mentioned in conversation and maintain structured records about them:
class EntityMemory:
def __init__(self):
self.entities: dict[str, dict] = {}
async def extract_and_update(self, client, message: str):
"""Use the LLM to extract entity information from a message."""
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"""Extract entities and facts from this message. Return JSON:
{{"entities": [{{"name": "...", "type": "person|project|company|concept", "facts": ["..."]}}]}}
Message: {message}"""
}],
response_format={"type": "json_object"}
)
extracted = json.loads(response.choices[0].message.content)
for entity in extracted.get("entities", []):
name = entity["name"]
if name not in self.entities:
self.entities[name] = {"type": entity["type"], "facts": []}
self.entities[name]["facts"].extend(entity["facts"])
def get_context_string(self) -> str:
if not self.entities:
return ""
lines = ["Known entities:"]
for name, info in self.entities.items():
facts = "; ".join(info["facts"][-5:]) # Keep last 5 facts
lines.append(f"- {name} ({info['type']}): {facts}")
return "\n".join(lines)
Long-Term Memory: Persistent Knowledge
This is where things get serious. Long-term memory requires a storage backend, a retrieval mechanism, and a strategy for what to remember.
Memory Architecture Comparison
| Architecture | Persistence | Query Speed | Capacity | Best For |
|---|---|---|---|---|
| Vector store (Pinecone, Chroma) | Disk/Cloud | ~50-200ms | Millions of entries | Semantic similarity search |
| Knowledge graph (Neo4j) | Disk | ~10-100ms | Millions of relationships | Relationship-heavy data |
| Key-value store (Redis) | Memory/Disk | ~1-5ms | Millions of keys | Fast exact lookups |
| SQL database | Disk | ~5-50ms | Billions of rows | Structured, queryable data |
| Full-text search (Elasticsearch) | Disk | ~10-100ms | Billions of documents | Keyword + fuzzy search |
For most agent use cases, I recommend a hybrid approach: vector store for semantic retrieval combined with a structured database for exact facts.
Building a Memory System
Here's a complete implementation using embeddings and a vector store:
import uuid
import numpy as np
from datetime import datetime
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class MemoryEntry:
"""A single memory record."""
id: str = field(default_factory=lambda: str(uuid.uuid4()))
content: str = ""
memory_type: str = "episodic" # episodic, semantic, procedural
importance: float = 0.5
access_count: int = 0
created_at: datetime = field(default_factory=datetime.now)
last_accessed: datetime = field(default_factory=datetime.now)
embedding: Optional[list[float]] = None
metadata: dict = field(default_factory=dict)
class LongTermMemory:
"""Persistent memory system with embedding-based retrieval."""
def __init__(self, embedding_client, dimension: int = 1536):
self.embedding_client = embedding_client
self.memories: dict[str, MemoryEntry] = {}
self.embeddings: dict[str, np.ndarray] = {}
async def _embed(self, text: str) -> np.ndarray:
"""Generate embedding for text."""
response = await self.embedding_client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return np.array(response.data[0].embedding)
async def store(
self, content: str, memory_type: str = "episodic",
importance: float = 0.5, metadata: dict = None
) -> str:
"""Store a new memory."""
embedding = await self._embed(content)
entry = MemoryEntry(
content=content,
memory_type=memory_type,
importance=importance,
embedding=embedding.tolist(),
metadata=metadata or {}
)
self.memories[entry.id] = entry
self.embeddings[entry.id] = embedding
return entry.id
async def retrieve(
self, query: str, top_k: int = 5,
memory_type: Optional[str] = None,
min_importance: float = 0.0
) -> list[MemoryEntry]:
"""Retrieve relevant memories using semantic similarity."""
query_embedding = await self._embed(query)
scores = []
for mem_id, mem_embedding in self.embeddings.items():
memory = self.memories[mem_id]
# Filter by type and importance
if memory_type and memory.memory_type != memory_type:
continue
if memory.importance < min_importance:
continue
# Cosine similarity
similarity = np.dot(query_embedding, mem_embedding) / (
np.linalg.norm(query_embedding) * np.linalg.norm(mem_embedding)
)
# Boost by recency (exponential decay)
hours_ago = (datetime.now() - memory.last_accessed).total_seconds() / 3600
recency_boost = np.exp(-0.01 * hours_ago)
# Combined score
final_score = 0.7 * similarity + 0.15 * recency_boost + 0.15 * memory.importance
scores.append((mem_id, final_score))
# Sort by score and return top-k
scores.sort(key=lambda x: x[1], reverse=True)
results = []
for mem_id, score in scores[:top_k]:
memory = self.memories[mem_id]
memory.access_count += 1
memory.last_accessed = datetime.now()
results.append(memory)
return results
async def forget(self, min_importance: float = 0.1, max_age_days: int = 90):
"""Remove old, unimportant memories - like natural forgetting."""
cutoff = datetime.now() - timedelta(days=max_age_days)
to_remove = [
mem_id for mem_id, memory in self.memories.items()
if memory.importance < min_importance
and memory.last_accessed < cutoff
and memory.access_count < 3
]
for mem_id in to_remove:
del self.memories[mem_id]
del self.embeddings[mem_id]
return len(to_remove)
Retrieval Strategies
Not all retrieval is simple similarity search. Depending on the context, you might want different strategies:
Recency-Weighted Retrieval
Give preference to recent memories. Useful for conversational agents where recent context matters most.
async def retrieve_recent_biased(self, query: str, top_k: int = 5) -> list[MemoryEntry]:
"""Retrieve with strong recency bias."""
query_embedding = await self._embed(query)
scores = []
for mem_id, mem_embedding in self.embeddings.items():
similarity = cosine_similarity(query_embedding, mem_embedding)
hours_ago = (datetime.now() - self.memories[mem_id].last_accessed).total_seconds() / 3600
recency = 1.0 / (1.0 + 0.1 * hours_ago) # Strong decay
score = 0.4 * similarity + 0.6 * recency
scores.append((mem_id, score))
scores.sort(key=lambda x: x[1], reverse=True)
return [self.memories[mid] for mid, _ in scores[:top_k]]
Importance-Based Retrieval
Prioritize memories that the agent or user explicitly marked as important.
Maximum Marginal Relevance (MMR)
Avoid retrieving five memories that all say the same thing. MMR balances relevance with diversity:
async def retrieve_mmr(
self, query: str, top_k: int = 5, diversity: float = 0.3
) -> list[MemoryEntry]:
"""Retrieve diverse, relevant memories using MMR."""
query_embedding = await self._embed(query)
# Get initial candidates (more than we need)
candidates = []
for mem_id, mem_embedding in self.embeddings.items():
sim = cosine_similarity(query_embedding, mem_embedding)
candidates.append((mem_id, sim, mem_embedding))
candidates.sort(key=lambda x: x[1], reverse=True)
candidates = candidates[:top_k * 3]
selected = []
selected_embeddings = []
for _ in range(top_k):
best_score = -1
best_idx = -1
for i, (mem_id, relevance, embedding) in enumerate(candidates):
if mem_id in [s[0] for s in selected]:
continue
# Max similarity to already-selected items
if selected_embeddings:
max_sim_to_selected = max(
cosine_similarity(embedding, se) for se in selected_embeddings
)
else:
max_sim_to_selected = 0
# MMR score: relevance minus redundancy
mmr_score = (1 - diversity) * relevance - diversity * max_sim_to_selected
if mmr_score > best_score:
best_score = mmr_score
best_idx = i
if best_idx >= 0:
selected.append(candidates[best_idx])
selected_embeddings.append(candidates[best_idx][2])
return [self.memories[mem_id] for mem_id, _, _ in selected]
The Memory Hierarchy Pattern
In practice, you'll use all three memory types together. Here's the pattern:
User query arrives
|
v
[1] Check working memory (current context)
- Is the answer already in the conversation?
|
v
[2] Check short-term memory (session history)
- Was this discussed earlier in the session?
|
v
[3] Check long-term memory (persistent store)
- Do we have stored knowledge about this?
|
v
[4] Use tools to fetch new information
- Query databases, search the web, etc.
|
v
[5] Store important new facts in long-term memory
class HierarchicalMemory:
"""Unified memory system combining all three tiers."""
def __init__(self, client, embedding_client):
self.working = ContextWindowManager(max_tokens=120000)
self.short_term = SummaryChainMemory(client, recent_window=15)
self.long_term = LongTermMemory(embedding_client)
async def recall(self, query: str) -> dict:
"""Search all memory tiers for relevant information."""
results = {
"working_memory": self._search_working(query),
"short_term": self._search_short_term(query),
"long_term": await self.long_term.retrieve(query, top_k=5),
}
return results
async def consolidate(self, message: Message):
"""Process a new message across all memory tiers."""
# Always add to short-term
await self.short_term.add(message)
# Selectively store in long-term if important
if message.importance > 0.7:
await self.long_term.store(
content=message.content,
memory_type="episodic",
importance=message.importance,
metadata={"role": message.role, "timestamp": message.timestamp}
)
def build_context(self, recent_messages: list[Message], relevant_memories: list[MemoryEntry]) -> list[dict]:
"""Build the final context for the LLM call."""
context = []
# Add long-term memory context
if relevant_memories:
memory_text = "\n".join(
f"- [{m.memory_type}] {m.content}" for m in relevant_memories
)
context.append({
"role": "system",
"content": f"Relevant information from memory:\n{memory_text}"
})
# Add conversation summary if available
short_term_context = self.short_term.get_context()
context.extend(short_term_context)
return context
Memory-Augmented Agent: Full Implementation
Let's put it all together into a working agent that remembers across sessions:
class MemoryAugmentedAgent:
"""An agent that maintains memory across conversations."""
def __init__(self, llm_client, embedding_client, user_id: str):
self.llm = llm_client
self.user_id = user_id
self.memory = HierarchicalMemory(llm_client, embedding_client)
self.system_prompt = """You are a helpful assistant with access to memory
of past conversations. Use the provided memory context to personalize
your responses and maintain continuity across sessions."""
async def chat(self, user_message: str) -> str:
# Step 1: Recall relevant memories
memories = await self.memory.recall(user_message)
relevant = memories["long_term"]
# Step 2: Build context with memory
context = self.memory.build_context(
recent_messages=self.memory.short_term.messages,
relevant_memories=relevant
)
# Step 3: Add system prompt and user message
messages = [{"role": "system", "content": self.system_prompt}]
messages.extend(context)
messages.append({"role": "user", "content": user_message})
# Step 4: Generate response
response = await self.llm.chat.completions.create(
model="gpt-4o",
messages=messages
)
assistant_message = response.choices[0].message.content
# Step 5: Store both messages in memory
user_msg = Message(
role="user", content=user_message,
timestamp=time.time(), importance=0.5
)
assistant_msg = Message(
role="assistant", content=assistant_message,
timestamp=time.time(), importance=0.5
)
await self.memory.consolidate(user_msg)
await self.memory.consolidate(assistant_msg)
# Step 6: Extract and store important facts
await self._extract_important_facts(user_message, assistant_message)
return assistant_message
async def _extract_important_facts(self, user_msg: str, assistant_msg: str):
"""Use LLM to identify facts worth remembering long-term."""
response = await self.llm.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"""Analyze this exchange and extract facts worth remembering
long-term about the user (preferences, personal details, goals,
important decisions). Return JSON:
{{"facts": [{{"content": "...", "importance": 0.0-1.0, "type": "preference|personal|goal|decision"}}]}}
If nothing is worth remembering, return {{"facts": []}}
User: {user_msg}
Assistant: {assistant_msg}"""
}],
response_format={"type": "json_object"}
)
extracted = json.loads(response.choices[0].message.content)
for fact in extracted.get("facts", []):
if fact["importance"] >= 0.6:
await self.memory.long_term.store(
content=fact["content"],
memory_type="semantic",
importance=fact["importance"],
metadata={"user_id": self.user_id, "fact_type": fact["type"]}
)
Run the fact extraction asynchronously - fire it off and don't wait for it. The user doesn't need to wait for memory consolidation before getting their response. This keeps the agent responsive while still building its knowledge base.
Memory and Privacy
Memory creates powerful capabilities, but it also creates serious obligations. You're storing personal information, potentially sensitive data, and behavioral patterns. Here's what you need to think about:
What to Store
- User-stated preferences ("I prefer Python over JavaScript")
- Explicit facts shared by the user ("My team has 5 engineers")
- Task context that will be needed later
- Decisions and their reasoning
What NOT to Store
- Sensitive personal data (SSNs, passwords, financial details) unless absolutely necessary and properly encrypted
- Health information (HIPAA implications)
- Anything the user asks you to forget
- Raw conversation logs beyond the retention period
Compliance Considerations
| Regulation | Requirement | Implementation |
|---|---|---|
| GDPR (EU) | Right to be forgotten | Implement delete_user_memories(user_id) |
| GDPR | Data minimization | Only store what's necessary |
| CCPA (California) | Right to know | Implement export_user_memories(user_id) |
| HIPAA (Health) | Data protection | Encrypt at rest and in transit |
| SOC 2 | Access controls | Log all memory access, implement RBAC |
class PrivacyAwareMemory(LongTermMemory):
"""Memory system with privacy controls."""
async def forget_user(self, user_id: str) -> int:
"""GDPR right to be forgotten - delete all memories for a user."""
to_remove = [
mem_id for mem_id, memory in self.memories.items()
if memory.metadata.get("user_id") == user_id
]
for mem_id in to_remove:
del self.memories[mem_id]
del self.embeddings[mem_id]
logger.info(f"Deleted {len(to_remove)} memories for user {user_id}")
return len(to_remove)
async def export_user_data(self, user_id: str) -> list[dict]:
"""CCPA right to know - export all stored data for a user."""
return [
{
"content": memory.content,
"type": memory.memory_type,
"created_at": memory.created_at.isoformat(),
"importance": memory.importance,
}
for memory in self.memories.values()
if memory.metadata.get("user_id") == user_id
]
async def store(self, content: str, **kwargs) -> str:
"""Override store to filter sensitive content."""
import re
# Redact common sensitive patterns before storing
redacted = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN_REDACTED]', content)
redacted = re.sub(r'\b\d{16}\b', '[CARD_REDACTED]', redacted)
redacted = re.sub(r'(?i)password\s*[:=]\s*\S+', 'password: [REDACTED]', redacted)
return await super().store(redacted, **kwargs)
Memory privacy isn't optional - it's a legal requirement in most jurisdictions. Build the delete and export functionality from day one. Retrofitting privacy controls into an existing memory system is painful and error-prone.
Common Memory Patterns
Episodic Memory
Records of specific events and interactions. "On March 15th, the user asked about deployment and I recommended Docker."
Semantic Memory
General knowledge and facts. "The user's company uses AWS. Their team follows agile methodology."
Procedural Memory
Learned procedures and workflows. "When the user says 'deploy', they mean push to the staging environment first, run tests, then promote to production."
# Procedural memory: store learned workflows
await memory.store(
content="User's deployment workflow: 1) Push to staging 2) Run integration tests 3) Get approval from #releases channel 4) Deploy to production 5) Monitor for 30 minutes",
memory_type="procedural",
importance=0.9,
metadata={"user_id": "deepak", "category": "workflow", "trigger": "deploy"}
)
Context Window Strategies for Long-Running Agents
When an agent runs for hours - processing data, monitoring systems, executing multi-step plans - context window management becomes critical. Here are battle-tested strategies:
-
Checkpoint and Reset: Periodically summarize progress, store it in long-term memory, and start a fresh context window with just the summary.
-
Scratchpad Pattern: Maintain a structured "scratchpad" in the system message that tracks current state, completed steps, and pending tasks.
-
Event Sourcing: Log every significant event and reconstruct state from the event log when needed.
class AgentScratchpad:
"""Structured working memory for long-running agents."""
def __init__(self):
self.current_goal: str = ""
self.completed_steps: list[str] = []
self.pending_steps: list[str] = []
self.key_findings: list[str] = []
self.errors: list[str] = []
def to_system_message(self) -> str:
return f"""## Agent State
**Goal**: {self.current_goal}
**Completed**: {chr(10).join(f'- {s}' for s in self.completed_steps[-5:])}
**Pending**: {chr(10).join(f'- {s}' for s in self.pending_steps[:5])}
**Key findings**: {chr(10).join(f'- {f}' for f in self.key_findings[-5:])}
**Errors**: {chr(10).join(f'- {e}' for e in self.errors[-3:])}"""
Memory is what transforms a stateless language model into a persistent, learning agent. The techniques in this chapter - from simple sliding windows to full hierarchical memory with privacy controls - give you the building blocks to create agents that don't just answer questions, but build genuine understanding over time. Start simple with conversation buffering, add summarization when context windows get tight, and invest in long-term memory when you're ready to create agents that truly learn from experience.