Chapter 30: Memory, Planning, and Long-Term Behavior
A language model responds to each prompt as if encountering it for the first time. It has no memory of previous conversations unless they are explicitly included in the current context window. This makes the model stateless: every interaction is independent, every session starts fresh.
But production AI systems need memory. A personal assistant should remember your preferences. A customer service bot should recall past interactions. A code completion tool should learn your coding style. Memory transforms a stateless text predictor into a system with continuity, identity, and the ability to improve over time.
This chapter explains how AI systems maintain memory, how memory enables long-term planning, and why persistent state fundamentally changes what these systems can do and how they behave.
Short-Term vs. Long-Term Memory
AI systems use two distinct types of memory, each with different characteristics and purposes.
Short-term memory is the context window: the tokens currently loaded into the model’s attention mechanism. This is working memory—immediately accessible but limited in size and duration.
Properties of short-term memory:
- Limited capacity: 4K to 200K tokens depending on model (roughly 3K to 150K words)
- Perfect recall: Every token in context is instantly accessible during generation
- Ephemeral: Disappears when the conversation ends or context window fills
- Expensive: Longer context means slower inference and higher cost
The context window functions like human working memory: you can hold a few things in mind at once, process them with full attention, but can’t remember everything you’ve ever experienced this way.
Long-term memory is external storage: databases, vector stores, file systems that persist information across sessions and beyond context window limits.
Properties of long-term memory:
- Unbounded capacity: Can store millions of interactions, documents, facts
- Selective retrieval: Must explicitly search for and load relevant memories
- Persistent: Survives across sessions, devices, model updates
- Cheaper at scale: Storage costs less than keeping everything in context
Long-term memory functions like human episodic and semantic memory: you can’t instantly access everything you’ve ever learned, but you can search your memory for relevant information when needed.
The memory hierarchy in AI systems:
Figure 30.1: Memory architecture in AI systems. Short-term memory (context window) provides fast, perfect recall but limited capacity. Working memory maintains current state and goals for the session. Long-term memory persists across sessions using vector databases and structured storage, requiring explicit retrieval but offering unbounded capacity.
The key engineering challenge is memory management: deciding what to keep in expensive short-term memory, what to offload to long-term storage, and when to retrieve past information.
Memory Types: Episodic and Semantic
Long-term memory systems distinguish between two types of memory, borrowed from cognitive science:
Episodic memory stores specific events and experiences: “What happened when?”
Examples:
- “User asked about vacation policy on March 15”
- “Flight booking failed due to payment error at 3:42pm”
- “User prefers morning meetings, dislikes Zoom”
Episodic memories are time-stamped, context-specific, and personal. They answer questions like “What did I tell you about my schedule?” or “What happened last time I tried this?”
Semantic memory stores general knowledge and patterns: “What is true in general?”
Examples:
- “User works in software engineering”
- “Python uses indentation for blocks”
- “Emails should have subject lines”
Semantic memories are timeless facts and rules. They answer questions like “What do you know about me?” or “How does this work?”
In practice, AI systems blur these categories. A memory might be “User prefers ‘async/await’ over Promises in JavaScript” (semantic: a general preference) but was learned from “User rewrote three functions to use async/await on Oct 10” (episodic: specific events).
Memory storage formats:
Conversation logs (episodic): Store full transcripts of past interactions.
{
"timestamp": "2024-03-15T10:30:00Z",
"user": "What's the vacation policy?",
"assistant": "You have 15 days per year, accruing at 1.25 days per month...",
"context": ["discussing benefits", "onboarding"]
}
Fact extraction (semantic): Parse conversations into structured facts.
{
"type": "user_preference",
"fact": "prefers morning meetings",
"confidence": 0.9,
"source": "conversation on 2024-03-10",
"category": "scheduling"
}
Embedding-based (both): Store memories as vectors for semantic search.
User said: "I hate long emails"
→ Embedded as vector
→ Later query: "How should I format this email?"
→ Retrieve relevant memory: "User prefers brief emails"
Production systems often combine all three: conversation logs for exact recall, facts for structured queries, embeddings for semantic retrieval.
Memory Retrieval: Finding Relevant Context
The challenge of long-term memory is retrieval: given a current query or task, which past memories are relevant?
Unlike short-term memory (where everything in context is equally accessible), long-term memory requires search. The system must decide what to fetch from storage and load into the context window.
Retrieval strategies:
Recency-based: Fetch the N most recent memories.
- Simple, fast, often effective (recent context is often relevant)
- Fails when relevant information is old (first meeting, initial preferences)
Keyword matching: Search for memories containing specific words or phrases.
- Works for explicit references (“What did I say about Python?”)
- Fails for semantic similarity (query “code style” won’t match memory “formatting preferences”)
Semantic search: Embed query and memories, retrieve by cosine similarity.
- Captures meaning beyond exact words
- Used in most production systems (via vector databases, Chapter 27)
- Query: “What time should we meet?” retrieves memory “User prefers 9am meetings”
Structured queries: Search extracted facts by category or type.
- “Get all user preferences related to scheduling”
- Fast for specific lookups, requires upfront fact extraction
Hybrid retrieval: Combine multiple strategies.
- Fetch recent memories (recency)
- Fetch semantically similar memories (embedding search)
- Fetch explicit references (keyword match)
- Merge and rank results
Retrieval in action:
Current conversation:
User: "I need to book a flight for next month's conference"
Retrieval process:
1. Semantic search: "booking flights" + "conferences"
→ Finds memory: "User traveled to MLConf 2023 in Boston"
2. Fact lookup: category = "travel_preferences"
→ Finds: "User prefers aisle seats", "User has TSA PreCheck"
3. Recency filter: last 30 days
→ Finds: "Conference mentioned: NeurIPS 2024, Vancouver, Dec 10-16"
4. Load into context:
- "User attending NeurIPS in Vancouver Dec 10-16"
- "Travel preferences: aisle seat, TSA PreCheck"
- "Previous conference: MLConf Boston"
Agent response:
"I'll help book your flight to Vancouver for NeurIPS (Dec 10-16). Based on your preferences, I'll look for aisle seats and include your TSA PreCheck number. Should I search for similar dates as your MLConf trip (arrive day before, leave day after)?"
The agent retrieved relevant memories and incorporated them into planning, making the interaction feel continuous and personalized rather than starting from scratch.
Retrieval challenges:
- Cold start: New users have no memory to retrieve
- Noise: Irrelevant memories retrieved alongside relevant ones
- Staleness: Old memories may be outdated (user changed preferences)
- Privacy: Retrieving sensitive information requires access control
- Cost: Every retrieval is a database query and embedding computation
Planning with Memory: Learning from Experience
Memory enables a powerful capability: learning from past actions to improve future planning. An agent can recall what worked, what failed, and what it learned, then apply this experience to new situations.
Example: Code debugging agent with memory
First interaction:
User: This test fails with "undefined is not a function"
Agent: Let me check the code...
[Debugging process, finds missing import]
Agent: The issue was a missing import statement.
Memory stored:
"Error pattern: 'undefined is not a function' → likely missing import or typo"
Later interaction with different user:
User: Getting "undefined is not a function" in my React component
Agent: [Retrieves memory of similar error pattern]
Agent: This error typically indicates a missing import. Let me check your imports first...
[Quickly identifies problem]
Memory updated:
"Error pattern confirmed in React context: check imports for components and hooks"
The agent learned from experience. The second debugging session was faster because the agent recalled the pattern from the first. This is not model fine-tuning (the model weights didn’t change)—it’s in-context learning through memory.
Planning with memory architecture:
Figure 30.2: Planning with long-term memory. When given a task, the agent retrieves relevant past experiences, uses them to inform planning, executes the plan, and stores the outcome as new memories. This creates a learning loop where each task improves future performance through accumulated experience.
This architecture enables several powerful behaviors:
Pattern recognition: “I’ve seen this type of problem before, here’s what worked”
Mistake avoidance: “Last time I tried approach X it failed because Y, so I’ll try Z instead”
Strategy reuse: “This task is similar to task T, I’ll adapt that successful plan”
Preference adaptation: “User corrected me twice on formatting, I’ll remember their preference”
Context efficiency: “I don’t need to ask questions I already know the answers to from past conversations”
Alignment and Identity: How Memory Changes Behavior
Persistent memory has a profound effect: it changes not just what the system knows but who it becomes. Memory creates continuity, identity, and alignment to specific users or contexts.
Identity through memory. A system with memory develops a consistent persona:
- Remembers past statements and maintains consistency
- Recalls commitments and follows through
- Builds on previous conversations rather than starting fresh
- Exhibits preferences learned from interactions
This creates the illusion (or reality?) of continuity of self. The system behaves as if it is the same entity across sessions because it has access to its own history.
Alignment through adaptation. Memory enables systems to adapt to individual users:
- Learn user communication style (formal vs. casual, technical vs. simple)
- Adapt to user preferences (level of detail, explanation style)
- Remember user feedback (“last time you said X was too verbose”)
- Build user-specific knowledge (“you work on the authentication service”)
This creates personalization: the system becomes aligned with the specific user’s needs and preferences without model fine-tuning.
Challenges of persistent memory:
Privacy: Memory stores sensitive information. Who can access it? How long is it kept? Can users delete their memories?
Bias accumulation: If the system learns from every interaction, harmful patterns can accumulate. A customer service bot might learn to be ruder if customers are rude, creating a negative feedback loop.
Staleness: Old memories may become incorrect. User preferences change, facts become outdated, past advice may no longer apply.
Context dependence: Memories from one context may not apply to another. The system must recognize when past experience is relevant vs. when the situation is fundamentally different.
Forgetting mechanisms: Production systems need ways to manage memory:
- Time-based decay: Old memories fade or are archived
- Relevance filtering: Rarely-accessed memories are demoted
- Explicit deletion: Users can remove specific memories
- Summary and compression: Detailed memories are compressed into general patterns over time
The engineering challenge is balancing continuity (remember enough to be useful) with flexibility (don’t over-fit to past patterns) and privacy (forget what should not be remembered).
Engineering Takeaway
Memory transforms stateless language models into stateful systems with identity and continuity. This transformation is fundamental: a model with memory is not just more capable—it becomes a different kind of system. This shift has several implications for production engineering:
Long-term memory enables personalization and learning. Systems with memory can adapt to individual users without model retraining. They learn preferences, recognize patterns, and improve through experience. This makes AI systems feel less like tools and more like assistants: they know you, remember your context, and build on past interactions. The value is in accumulated knowledge, not just model capability.
Memory retrieval is the critical challenge. Having memory is useless if you can’t find the relevant information when you need it. Production systems require sophisticated retrieval: semantic search for meaning, recency weighting for relevance, structured queries for facts. The quality of retrieval determines whether memory helps or adds noise. Poor retrieval is worse than no memory—irrelevant context confuses the model and wastes tokens.
Planning with memory enables continuous improvement. Agents that remember past actions can learn what works and what fails. This creates a feedback loop: try strategy, observe result, remember outcome, improve future attempts. Unlike model training (which happens once on static data), memory-based learning is continuous and context-specific. The system gets better at your specific tasks through experience, not through generic training.
Privacy is paramount and non-negotiable. Memory systems store sensitive information: user preferences, conversation history, personal facts, business data. This requires serious security: encryption at rest and in transit, access controls, audit logging, user-controlled deletion. GDPR and similar regulations apply: users must be able to see what’s stored and request deletion. Memory systems are data systems, and data systems have legal obligations.
Memory changes alignment—models adapt to interactions over time. This is powerful but dangerous. A model that learns from every interaction can absorb biases, harmful patterns, or user-specific quirks that shouldn’t generalize. Memory-based adaptation happens faster than RLHF-style alignment but with less oversight. Production systems need guardrails: filter what gets stored, review patterns that emerge, prevent accumulation of harmful behaviors.
State management is the core engineering problem. Production AI systems with memory are fundamentally about managing state: what to keep in context, what to store long-term, when to retrieve, when to forget. This requires database design, caching strategies, consistency guarantees, backup and recovery. Building stateful AI systems has more in common with building distributed databases than with training models. The challenge is state, not statistics.
AI systems with memory are no longer stateless services—they’re stateful applications. This changes deployment, testing, and maintenance. You can’t just rollback to a previous model version if the system has accumulated user-specific state. Testing requires seeding memory, not just checking input-output pairs. Debugging requires inspecting what the system remembers, not just what it generates. The system’s behavior depends on its history, making every deployment unique to its accumulated experience.
References and Further Reading
MemGPT: Towards LLMs as Operating Systems Packer, C., Fang, V., Patil, S. G., Wooders, K., & Gonzalez, J. E. (2023). arXiv:2310.08560
Why it matters: This paper introduced the analogy between operating system memory hierarchies (registers, cache, RAM, disk) and LLM memory systems (context window, working memory, long-term storage). MemGPT demonstrated how to manage context overflow by intelligently paging information in and out of the context window, similar to virtual memory in OS design. This architecture enables agents to operate indefinitely by treating the context window as a cache for a much larger memory space, solving one of the fundamental scalability challenges of long-running agents.
Memory Networks Weston, J., Chopra, S., & Bordes, A. (2015). ICLR 2015
Why it matters: While predating modern LLMs, this foundational work introduced the idea of neural networks with explicit external memory that can be read from and written to. Memory Networks showed that models could learn to store facts in memory and retrieve them when needed, rather than encoding everything in weights. This separation of computation (the model) from storage (the memory) influenced modern RAG systems and demonstrated that retrieval-augmented architectures could outperform purely parametric models on knowledge-intensive tasks.
Generative Agents: Interactive Simulacra of Human Behavior Park, J. S., O’Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., & Bernstein, M. S. (2023). UIST 2023
Why it matters: This work demonstrated agents with rich memory systems operating in a simulated environment over extended periods. Each agent maintained three types of memory: observations (what happened), reflections (higher-level insights), and plans (future intentions). Agents retrieved relevant memories through recency, importance, and relevance scoring. The system showed how memory enables coherent long-term behavior, social interaction, and emergent phenomena like coordinated planning. It revealed both the power of memory (enabling complex behavior) and its challenges (managing growth, ensuring relevance, maintaining consistency).
With memory and planning, we complete the picture of modern AI systems: models (Part 5) augmented with prompting (Ch. 26), retrieval (Ch. 27), tools (Ch. 28), agency (Ch. 29), and memory (Ch. 30). These components compose into systems that perceive, reason, act, remember, and improve over time—the fundamental capabilities required for AI to be useful in the real world.
Part 7 will examine Engineering Reality: how these systems fail, how to evaluate them, how to deploy them safely, and what challenges remain unsolved.