Skip to content

Memory and Context Management

Here's the dirty secret about LLMs: they have no memory. Every single API call starts from a blank slate. The model doesn't remember your last conversation, your preferences, or even what it said three messages ago - unless you explicitly feed that history back in. This is fundamentally different from how humans think, and it creates a genuine engineering challenge when building agents that need to maintain continuity over time.

If tools are the agent's hands and feet (Chapter 4), memory is its hippocampus. Without it, every interaction is a first date - polite but superficial. With well-designed memory, your agent builds relationships, accumulates knowledge, and gets better at its job over time.

In this chapter, we'll build a complete memory system from the ground up, covering everything from managing the context window to building persistent memory that survives across sessions.

The Memory Problem

Let's be concrete about what we're dealing with. When you call an LLM API, you send a list of messages. The model processes them all at once and generates a response. Then it forgets everything. The next call is completely independent.

# Call 1: The model knows about this exchange
response = client.chat.completions.create(
    messages=[
        {"role": "user", "content": "My name is Deepak."},
    ]
)
# response: "Nice to meet you, Deepak!"

# Call 2: The model has no idea who Deepak is
response = client.chat.completions.create(
    messages=[
        {"role": "user", "content": "What's my name?"},
    ]
)
# response: "I don't know your name. Could you tell me?"

To make the second call work, you'd need to include the entire history:

response = client.chat.completions.create(
    messages=[
        {"role": "user", "content": "My name is Deepak."},
        {"role": "assistant", "content": "Nice to meet you, Deepak!"},
        {"role": "user", "content": "What's my name?"},
    ]
)
# response: "Your name is Deepak!"

This is fine for short conversations. But what happens when you have 200 messages? Or 2,000? Or when the agent has been running for six months and has accumulated gigabytes of interaction history? You hit the context window limit, and you hit it hard.

Types of Memory

To solve this, we borrow concepts from cognitive science and database design. Agent memory breaks down into three distinct tiers, each serving a different purpose.

Working Memory (Context Window)

This is what the model can see right now - the messages in the current API call. It's fast, accurate, and severely limited. Think of it as your desk: everything on it is immediately accessible, but you can only fit so much.

Model Context Window Approximate Token Budget
GPT-4o 128K tokens ~96K pages of text
Claude (Sonnet/Opus) 200K tokens ~150K pages of text
Gemini 1.5 Pro 2M tokens ~1.5M pages of text
Llama 3 (70B) 128K tokens ~96K pages of text

These numbers look generous until you realize that tool definitions, system prompts, and conversation history all eat into the budget. In practice, you often have 60-70% of the window available for actual conversation content.

Short-Term Memory (Conversation)

This is the full conversation history, including parts that have been pruned from the context window. It persists for the duration of a session or task but not beyond. Think of it as your notebook - you write things down as you go, but you put the notebook away when the project ends.

Long-Term Memory (Persistent)

This is information that survives across sessions, days, and months. User preferences, learned facts, past interactions, accumulated knowledge. Think of it as your filing cabinet or your personal knowledge base - organized, searchable, and always available.

Note

Most agent tutorials stop at short-term memory. That's like building a colleague who forgets everything every time they leave the room. Long-term memory is what makes an agent genuinely useful over time.

Working Memory Management

The context window is your most precious resource. Every token matters. Here are the strategies you need for managing it effectively.

Message Pruning

The simplest approach: when the conversation gets too long, remove old messages. But be smarter than just chopping from the front.

from dataclasses import dataclass
from typing import Optional
import tiktoken


@dataclass
class Message:
    role: str
    content: str
    timestamp: float
    importance: float = 0.5  # 0-1 scale
    token_count: Optional[int] = None

    def count_tokens(self, model: str = "gpt-4o") -> int:
        if self.token_count is None:
            enc = tiktoken.encoding_for_model(model)
            self.token_count = len(enc.encode(self.content))
        return self.token_count


class ContextWindowManager:
    """Manages messages to fit within the context window."""

    def __init__(self, max_tokens: int = 120000, reserved_tokens: int = 20000):
        self.max_tokens = max_tokens
        self.reserved_tokens = reserved_tokens  # For system prompt + tools + response
        self.available_tokens = max_tokens - reserved_tokens

    def fit_messages(self, messages: list[Message]) -> list[Message]:
        """Select messages that fit in the context window."""
        total_tokens = sum(m.count_tokens() for m in messages)

        if total_tokens <= self.available_tokens:
            return messages  # Everything fits

        # Always keep: system message (first) and recent messages (last N)
        system_msg = messages[0] if messages[0].role == "system" else None
        recent_count = min(10, len(messages))
        recent = messages[-recent_count:]
        middle = messages[1:-recent_count] if system_msg else messages[:-recent_count]

        # Score middle messages by recency and importance
        for i, msg in enumerate(middle):
            recency_score = i / len(middle)  # Higher = more recent
            msg._priority = 0.4 * recency_score + 0.6 * msg.importance

        # Sort by priority (keep highest)
        middle.sort(key=lambda m: m._priority, reverse=True)

        # Greedily add messages until budget is exhausted
        budget = self.available_tokens
        kept = []

        if system_msg:
            budget -= system_msg.count_tokens()
            kept.append(system_msg)

        for msg in recent:
            budget -= msg.count_tokens()

        for msg in middle:
            cost = msg.count_tokens()
            if cost <= budget:
                kept.append(msg)
                budget -= cost

        # Reconstruct in chronological order
        kept.extend(recent)
        kept.sort(key=lambda m: m.timestamp)
        return kept

Summarization

Instead of just dropping messages, summarize them. This preserves information density while reducing token count.

async def summarize_conversation(
    client, messages: list[Message], max_summary_tokens: int = 500
) -> str:
    """Summarize a block of conversation into a concise paragraph."""
    conversation_text = "\n".join(
        f"{m.role}: {m.content}" for m in messages
    )
    response = await client.chat.completions.create(
        model="gpt-4o-mini",  # Use a cheaper model for summarization
        messages=[
            {
                "role": "system",
                "content": "Summarize the following conversation concisely. Preserve key facts, decisions, and action items. Be specific - include names, numbers, and dates."
            },
            {"role": "user", "content": conversation_text}
        ],
        max_tokens=max_summary_tokens
    )
    return response.choices[0].message.content


class SummarizingContextManager(ContextWindowManager):
    """Context manager that summarizes old messages instead of dropping them."""

    def __init__(self, client, **kwargs):
        super().__init__(**kwargs)
        self.client = client
        self.summary_buffer: Optional[str] = None

    async def fit_messages_with_summary(self, messages: list[Message]) -> list[Message]:
        total_tokens = sum(m.count_tokens() for m in messages)

        if total_tokens <= self.available_tokens:
            return messages

        # Split: summarize the first half, keep the second half verbatim
        split_point = len(messages) // 2
        to_summarize = messages[:split_point]
        to_keep = messages[split_point:]

        summary = await summarize_conversation(self.client, to_summarize)
        self.summary_buffer = summary

        summary_message = Message(
            role="system",
            content=f"Summary of earlier conversation:\n{summary}",
            timestamp=to_summarize[0].timestamp,
            importance=0.9
        )

        return [summary_message] + to_keep
Tip

Use a cheaper, faster model for summarization (like GPT-4o-mini or Claude Haiku). You're doing an internal operation, not user-facing generation. Save your expensive model tokens for the actual reasoning.

Short-Term Memory: Conversation Patterns

Beyond simple message lists, there are several architectural patterns for managing conversation-level memory.

Sliding Window

Keep the last N messages. Simple, predictable, but loses early context:

class SlidingWindowMemory:
    def __init__(self, window_size: int = 20):
        self.window_size = window_size
        self.messages: list[Message] = []

    def add(self, message: Message):
        self.messages.append(message)

    def get_context(self) -> list[Message]:
        return self.messages[-self.window_size:]

Summary Chain

Maintain a running summary that gets updated as the conversation progresses:

class SummaryChainMemory:
    def __init__(self, client, recent_window: int = 10):
        self.client = client
        self.recent_window = recent_window
        self.messages: list[Message] = []
        self.running_summary: str = ""

    async def add(self, message: Message):
        self.messages.append(message)

        # When we accumulate enough messages beyond the window, summarize
        if len(self.messages) > self.recent_window * 2:
            old = self.messages[:self.recent_window]
            combined = f"Previous summary: {self.running_summary}\n\nNew messages:\n"
            combined += "\n".join(f"{m.role}: {m.content}" for m in old)

            self.running_summary = await summarize_conversation(
                self.client, old, max_summary_tokens=300
            )
            self.messages = self.messages[self.recent_window:]

    def get_context(self) -> list[dict]:
        context = []
        if self.running_summary:
            context.append({
                "role": "system",
                "content": f"Conversation history summary: {self.running_summary}"
            })
        context.extend([{"role": m.role, "content": m.content} for m in self.messages])
        return context

Entity Memory

Track specific entities (people, projects, concepts) mentioned in conversation and maintain structured records about them:

class EntityMemory:
    def __init__(self):
        self.entities: dict[str, dict] = {}

    async def extract_and_update(self, client, message: str):
        """Use the LLM to extract entity information from a message."""
        response = await client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{
                "role": "user",
                "content": f"""Extract entities and facts from this message. Return JSON:
                {{"entities": [{{"name": "...", "type": "person|project|company|concept", "facts": ["..."]}}]}}

                Message: {message}"""
            }],
            response_format={"type": "json_object"}
        )
        extracted = json.loads(response.choices[0].message.content)

        for entity in extracted.get("entities", []):
            name = entity["name"]
            if name not in self.entities:
                self.entities[name] = {"type": entity["type"], "facts": []}
            self.entities[name]["facts"].extend(entity["facts"])

    def get_context_string(self) -> str:
        if not self.entities:
            return ""
        lines = ["Known entities:"]
        for name, info in self.entities.items():
            facts = "; ".join(info["facts"][-5:])  # Keep last 5 facts
            lines.append(f"- {name} ({info['type']}): {facts}")
        return "\n".join(lines)

Long-Term Memory: Persistent Knowledge

This is where things get serious. Long-term memory requires a storage backend, a retrieval mechanism, and a strategy for what to remember.

Memory Architecture Comparison

Architecture Persistence Query Speed Capacity Best For
Vector store (Pinecone, Chroma) Disk/Cloud ~50-200ms Millions of entries Semantic similarity search
Knowledge graph (Neo4j) Disk ~10-100ms Millions of relationships Relationship-heavy data
Key-value store (Redis) Memory/Disk ~1-5ms Millions of keys Fast exact lookups
SQL database Disk ~5-50ms Billions of rows Structured, queryable data
Full-text search (Elasticsearch) Disk ~10-100ms Billions of documents Keyword + fuzzy search

For most agent use cases, I recommend a hybrid approach: vector store for semantic retrieval combined with a structured database for exact facts.

Building a Memory System

Here's a complete implementation using embeddings and a vector store:

import uuid
import numpy as np
from datetime import datetime
from dataclasses import dataclass, field
from typing import Optional


@dataclass
class MemoryEntry:
    """A single memory record."""
    id: str = field(default_factory=lambda: str(uuid.uuid4()))
    content: str = ""
    memory_type: str = "episodic"  # episodic, semantic, procedural
    importance: float = 0.5
    access_count: int = 0
    created_at: datetime = field(default_factory=datetime.now)
    last_accessed: datetime = field(default_factory=datetime.now)
    embedding: Optional[list[float]] = None
    metadata: dict = field(default_factory=dict)


class LongTermMemory:
    """Persistent memory system with embedding-based retrieval."""

    def __init__(self, embedding_client, dimension: int = 1536):
        self.embedding_client = embedding_client
        self.memories: dict[str, MemoryEntry] = {}
        self.embeddings: dict[str, np.ndarray] = {}

    async def _embed(self, text: str) -> np.ndarray:
        """Generate embedding for text."""
        response = await self.embedding_client.embeddings.create(
            model="text-embedding-3-small",
            input=text
        )
        return np.array(response.data[0].embedding)

    async def store(
        self, content: str, memory_type: str = "episodic",
        importance: float = 0.5, metadata: dict = None
    ) -> str:
        """Store a new memory."""
        embedding = await self._embed(content)
        entry = MemoryEntry(
            content=content,
            memory_type=memory_type,
            importance=importance,
            embedding=embedding.tolist(),
            metadata=metadata or {}
        )
        self.memories[entry.id] = entry
        self.embeddings[entry.id] = embedding
        return entry.id

    async def retrieve(
        self, query: str, top_k: int = 5,
        memory_type: Optional[str] = None,
        min_importance: float = 0.0
    ) -> list[MemoryEntry]:
        """Retrieve relevant memories using semantic similarity."""
        query_embedding = await self._embed(query)

        scores = []
        for mem_id, mem_embedding in self.embeddings.items():
            memory = self.memories[mem_id]

            # Filter by type and importance
            if memory_type and memory.memory_type != memory_type:
                continue
            if memory.importance < min_importance:
                continue

            # Cosine similarity
            similarity = np.dot(query_embedding, mem_embedding) / (
                np.linalg.norm(query_embedding) * np.linalg.norm(mem_embedding)
            )

            # Boost by recency (exponential decay)
            hours_ago = (datetime.now() - memory.last_accessed).total_seconds() / 3600
            recency_boost = np.exp(-0.01 * hours_ago)

            # Combined score
            final_score = 0.7 * similarity + 0.15 * recency_boost + 0.15 * memory.importance
            scores.append((mem_id, final_score))

        # Sort by score and return top-k
        scores.sort(key=lambda x: x[1], reverse=True)
        results = []
        for mem_id, score in scores[:top_k]:
            memory = self.memories[mem_id]
            memory.access_count += 1
            memory.last_accessed = datetime.now()
            results.append(memory)

        return results

    async def forget(self, min_importance: float = 0.1, max_age_days: int = 90):
        """Remove old, unimportant memories - like natural forgetting."""
        cutoff = datetime.now() - timedelta(days=max_age_days)
        to_remove = [
            mem_id for mem_id, memory in self.memories.items()
            if memory.importance < min_importance
            and memory.last_accessed < cutoff
            and memory.access_count < 3
        ]
        for mem_id in to_remove:
            del self.memories[mem_id]
            del self.embeddings[mem_id]
        return len(to_remove)

Retrieval Strategies

Not all retrieval is simple similarity search. Depending on the context, you might want different strategies:

Recency-Weighted Retrieval

Give preference to recent memories. Useful for conversational agents where recent context matters most.

async def retrieve_recent_biased(self, query: str, top_k: int = 5) -> list[MemoryEntry]:
    """Retrieve with strong recency bias."""
    query_embedding = await self._embed(query)
    scores = []
    for mem_id, mem_embedding in self.embeddings.items():
        similarity = cosine_similarity(query_embedding, mem_embedding)
        hours_ago = (datetime.now() - self.memories[mem_id].last_accessed).total_seconds() / 3600
        recency = 1.0 / (1.0 + 0.1 * hours_ago)  # Strong decay
        score = 0.4 * similarity + 0.6 * recency
        scores.append((mem_id, score))
    scores.sort(key=lambda x: x[1], reverse=True)
    return [self.memories[mid] for mid, _ in scores[:top_k]]

Importance-Based Retrieval

Prioritize memories that the agent or user explicitly marked as important.

Maximum Marginal Relevance (MMR)

Avoid retrieving five memories that all say the same thing. MMR balances relevance with diversity:

async def retrieve_mmr(
    self, query: str, top_k: int = 5, diversity: float = 0.3
) -> list[MemoryEntry]:
    """Retrieve diverse, relevant memories using MMR."""
    query_embedding = await self._embed(query)

    # Get initial candidates (more than we need)
    candidates = []
    for mem_id, mem_embedding in self.embeddings.items():
        sim = cosine_similarity(query_embedding, mem_embedding)
        candidates.append((mem_id, sim, mem_embedding))
    candidates.sort(key=lambda x: x[1], reverse=True)
    candidates = candidates[:top_k * 3]

    selected = []
    selected_embeddings = []

    for _ in range(top_k):
        best_score = -1
        best_idx = -1
        for i, (mem_id, relevance, embedding) in enumerate(candidates):
            if mem_id in [s[0] for s in selected]:
                continue
            # Max similarity to already-selected items
            if selected_embeddings:
                max_sim_to_selected = max(
                    cosine_similarity(embedding, se) for se in selected_embeddings
                )
            else:
                max_sim_to_selected = 0
            # MMR score: relevance minus redundancy
            mmr_score = (1 - diversity) * relevance - diversity * max_sim_to_selected
            if mmr_score > best_score:
                best_score = mmr_score
                best_idx = i

        if best_idx >= 0:
            selected.append(candidates[best_idx])
            selected_embeddings.append(candidates[best_idx][2])

    return [self.memories[mem_id] for mem_id, _, _ in selected]

The Memory Hierarchy Pattern

In practice, you'll use all three memory types together. Here's the pattern:

User query arrives
    |
    v
[1] Check working memory (current context)
    - Is the answer already in the conversation?
    |
    v
[2] Check short-term memory (session history)
    - Was this discussed earlier in the session?
    |
    v
[3] Check long-term memory (persistent store)
    - Do we have stored knowledge about this?
    |
    v
[4] Use tools to fetch new information
    - Query databases, search the web, etc.
    |
    v
[5] Store important new facts in long-term memory
class HierarchicalMemory:
    """Unified memory system combining all three tiers."""

    def __init__(self, client, embedding_client):
        self.working = ContextWindowManager(max_tokens=120000)
        self.short_term = SummaryChainMemory(client, recent_window=15)
        self.long_term = LongTermMemory(embedding_client)

    async def recall(self, query: str) -> dict:
        """Search all memory tiers for relevant information."""
        results = {
            "working_memory": self._search_working(query),
            "short_term": self._search_short_term(query),
            "long_term": await self.long_term.retrieve(query, top_k=5),
        }
        return results

    async def consolidate(self, message: Message):
        """Process a new message across all memory tiers."""
        # Always add to short-term
        await self.short_term.add(message)

        # Selectively store in long-term if important
        if message.importance > 0.7:
            await self.long_term.store(
                content=message.content,
                memory_type="episodic",
                importance=message.importance,
                metadata={"role": message.role, "timestamp": message.timestamp}
            )

    def build_context(self, recent_messages: list[Message], relevant_memories: list[MemoryEntry]) -> list[dict]:
        """Build the final context for the LLM call."""
        context = []

        # Add long-term memory context
        if relevant_memories:
            memory_text = "\n".join(
                f"- [{m.memory_type}] {m.content}" for m in relevant_memories
            )
            context.append({
                "role": "system",
                "content": f"Relevant information from memory:\n{memory_text}"
            })

        # Add conversation summary if available
        short_term_context = self.short_term.get_context()
        context.extend(short_term_context)

        return context

Memory-Augmented Agent: Full Implementation

Let's put it all together into a working agent that remembers across sessions:

class MemoryAugmentedAgent:
    """An agent that maintains memory across conversations."""

    def __init__(self, llm_client, embedding_client, user_id: str):
        self.llm = llm_client
        self.user_id = user_id
        self.memory = HierarchicalMemory(llm_client, embedding_client)
        self.system_prompt = """You are a helpful assistant with access to memory
        of past conversations. Use the provided memory context to personalize
        your responses and maintain continuity across sessions."""

    async def chat(self, user_message: str) -> str:
        # Step 1: Recall relevant memories
        memories = await self.memory.recall(user_message)
        relevant = memories["long_term"]

        # Step 2: Build context with memory
        context = self.memory.build_context(
            recent_messages=self.memory.short_term.messages,
            relevant_memories=relevant
        )

        # Step 3: Add system prompt and user message
        messages = [{"role": "system", "content": self.system_prompt}]
        messages.extend(context)
        messages.append({"role": "user", "content": user_message})

        # Step 4: Generate response
        response = await self.llm.chat.completions.create(
            model="gpt-4o",
            messages=messages
        )
        assistant_message = response.choices[0].message.content

        # Step 5: Store both messages in memory
        user_msg = Message(
            role="user", content=user_message,
            timestamp=time.time(), importance=0.5
        )
        assistant_msg = Message(
            role="assistant", content=assistant_message,
            timestamp=time.time(), importance=0.5
        )

        await self.memory.consolidate(user_msg)
        await self.memory.consolidate(assistant_msg)

        # Step 6: Extract and store important facts
        await self._extract_important_facts(user_message, assistant_message)

        return assistant_message

    async def _extract_important_facts(self, user_msg: str, assistant_msg: str):
        """Use LLM to identify facts worth remembering long-term."""
        response = await self.llm.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{
                "role": "user",
                "content": f"""Analyze this exchange and extract facts worth remembering
                long-term about the user (preferences, personal details, goals,
                important decisions). Return JSON:
                {{"facts": [{{"content": "...", "importance": 0.0-1.0, "type": "preference|personal|goal|decision"}}]}}
                If nothing is worth remembering, return {{"facts": []}}

                User: {user_msg}
                Assistant: {assistant_msg}"""
            }],
            response_format={"type": "json_object"}
        )
        extracted = json.loads(response.choices[0].message.content)
        for fact in extracted.get("facts", []):
            if fact["importance"] >= 0.6:
                await self.memory.long_term.store(
                    content=fact["content"],
                    memory_type="semantic",
                    importance=fact["importance"],
                    metadata={"user_id": self.user_id, "fact_type": fact["type"]}
                )
Tip

Run the fact extraction asynchronously - fire it off and don't wait for it. The user doesn't need to wait for memory consolidation before getting their response. This keeps the agent responsive while still building its knowledge base.

Memory and Privacy

Memory creates powerful capabilities, but it also creates serious obligations. You're storing personal information, potentially sensitive data, and behavioral patterns. Here's what you need to think about:

What to Store

  • User-stated preferences ("I prefer Python over JavaScript")
  • Explicit facts shared by the user ("My team has 5 engineers")
  • Task context that will be needed later
  • Decisions and their reasoning

What NOT to Store

  • Sensitive personal data (SSNs, passwords, financial details) unless absolutely necessary and properly encrypted
  • Health information (HIPAA implications)
  • Anything the user asks you to forget
  • Raw conversation logs beyond the retention period

Compliance Considerations

Regulation Requirement Implementation
GDPR (EU) Right to be forgotten Implement delete_user_memories(user_id)
GDPR Data minimization Only store what's necessary
CCPA (California) Right to know Implement export_user_memories(user_id)
HIPAA (Health) Data protection Encrypt at rest and in transit
SOC 2 Access controls Log all memory access, implement RBAC
class PrivacyAwareMemory(LongTermMemory):
    """Memory system with privacy controls."""

    async def forget_user(self, user_id: str) -> int:
        """GDPR right to be forgotten - delete all memories for a user."""
        to_remove = [
            mem_id for mem_id, memory in self.memories.items()
            if memory.metadata.get("user_id") == user_id
        ]
        for mem_id in to_remove:
            del self.memories[mem_id]
            del self.embeddings[mem_id]
        logger.info(f"Deleted {len(to_remove)} memories for user {user_id}")
        return len(to_remove)

    async def export_user_data(self, user_id: str) -> list[dict]:
        """CCPA right to know - export all stored data for a user."""
        return [
            {
                "content": memory.content,
                "type": memory.memory_type,
                "created_at": memory.created_at.isoformat(),
                "importance": memory.importance,
            }
            for memory in self.memories.values()
            if memory.metadata.get("user_id") == user_id
        ]

    async def store(self, content: str, **kwargs) -> str:
        """Override store to filter sensitive content."""
        import re
        # Redact common sensitive patterns before storing
        redacted = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN_REDACTED]', content)
        redacted = re.sub(r'\b\d{16}\b', '[CARD_REDACTED]', redacted)
        redacted = re.sub(r'(?i)password\s*[:=]\s*\S+', 'password: [REDACTED]', redacted)
        return await super().store(redacted, **kwargs)
Warning

Memory privacy isn't optional - it's a legal requirement in most jurisdictions. Build the delete and export functionality from day one. Retrofitting privacy controls into an existing memory system is painful and error-prone.

Common Memory Patterns

Episodic Memory

Records of specific events and interactions. "On March 15th, the user asked about deployment and I recommended Docker."

Semantic Memory

General knowledge and facts. "The user's company uses AWS. Their team follows agile methodology."

Procedural Memory

Learned procedures and workflows. "When the user says 'deploy', they mean push to the staging environment first, run tests, then promote to production."

# Procedural memory: store learned workflows
await memory.store(
    content="User's deployment workflow: 1) Push to staging 2) Run integration tests 3) Get approval from #releases channel 4) Deploy to production 5) Monitor for 30 minutes",
    memory_type="procedural",
    importance=0.9,
    metadata={"user_id": "deepak", "category": "workflow", "trigger": "deploy"}
)

Context Window Strategies for Long-Running Agents

When an agent runs for hours - processing data, monitoring systems, executing multi-step plans - context window management becomes critical. Here are battle-tested strategies:

  1. Checkpoint and Reset: Periodically summarize progress, store it in long-term memory, and start a fresh context window with just the summary.

  2. Scratchpad Pattern: Maintain a structured "scratchpad" in the system message that tracks current state, completed steps, and pending tasks.

  3. Event Sourcing: Log every significant event and reconstruct state from the event log when needed.

class AgentScratchpad:
    """Structured working memory for long-running agents."""

    def __init__(self):
        self.current_goal: str = ""
        self.completed_steps: list[str] = []
        self.pending_steps: list[str] = []
        self.key_findings: list[str] = []
        self.errors: list[str] = []

    def to_system_message(self) -> str:
        return f"""## Agent State
**Goal**: {self.current_goal}

**Completed**: {chr(10).join(f'- {s}' for s in self.completed_steps[-5:])}

**Pending**: {chr(10).join(f'- {s}' for s in self.pending_steps[:5])}

**Key findings**: {chr(10).join(f'- {f}' for f in self.key_findings[-5:])}

**Errors**: {chr(10).join(f'- {e}' for e in self.errors[-3:])}"""

Memory is what transforms a stateless language model into a persistent, learning agent. The techniques in this chapter - from simple sliding windows to full hierarchical memory with privacy controls - give you the building blocks to create agents that don't just answer questions, but build genuine understanding over time. Start simple with conversation buffering, add summarization when context windows get tight, and invest in long-term memory when you're ready to create agents that truly learn from experience.