Skip to content

Agent Memory Systems

Bit: An LLM without memory is a brilliant person with amnesia — they can reason perfectly but can't remember what happened 5 minutes ago. Agent memory is how you give AI systems persistence, learning, and context across interactions.


★ TL;DR

  • What: Architectural patterns for giving AI agents persistent memory — conversation history, semantic recall, structured knowledge, and episodic learning
  • Why: Without memory, every interaction starts from zero. Memory enables personalization, multi-session reasoning, and agents that learn from experience.
  • Key point: Memory is not one thing — it's a taxonomy (working, episodic, semantic, procedural) that maps to different implementation patterns (context window, vector store, knowledge graph, tool results cache).

★ Overview

Definition

Agent memory encompasses all mechanisms that allow an AI agent to retain, retrieve, and use information beyond the current prompt. This includes conversation history, learned user preferences, retrieved knowledge, and accumulated task experience.

Scope

Covers: Memory taxonomy, implementation patterns (context stuffing, RAG-based recall, summarization chains, knowledge graphs), production code, and failure modes. For the broader agent architecture, see AI Agents. For retrieval specifically, see RAG.

Significance

  • Personalization: Users expect AI to remember preferences and context
  • Multi-session continuity: Agents that forget between sessions feel broken
  • Learning agents: The frontier — agents that improve from their own experience
  • Interview topic: "How would you give an agent long-term memory?" is a common system design question

Prerequisites


★ Deep Dive

The Memory Taxonomy

HUMAN MEMORY                          AGENT MEMORY EQUIVALENT
─────────────                         ──────────────────────

WORKING MEMORY                        CONTEXT WINDOW
  "What I'm thinking about now"         Current prompt + recent messages
  Capacity: ~7 items                    Capacity: 128K-2M tokens
  Duration: seconds                     Duration: single request

EPISODIC MEMORY                       CONVERSATION HISTORY + LOGS
  "What happened to me"                 Past interactions, stored and retrieved
  Capacity: lifetime                    Capacity: unlimited (with retrieval)
  Duration: permanent                   Duration: session or persistent

SEMANTIC MEMORY                       KNOWLEDGE BASE / RAG
  "What I know about the world"         Facts, documents, embeddings
  Capacity: vast                        Capacity: unlimited
  Duration: permanent                   Duration: permanent

PROCEDURAL MEMORY                     TOOLS + LEARNED BEHAVIORS
  "How to do things"                    Tool definitions, few-shot examples,
  Capacity: skills                      fine-tuned behaviors
  Duration: permanent                   Duration: permanent

GRAPH-BASED MEMORY                    KNOWLEDGE GRAPH STORE
  "Relationships between things"         Entities + typed edges ("User → works at → Acme")
  Capacity: structured, scalable         Neo4j, Kuzu, in-memory graph
  Duration: permanent                    Best for: complex domain reasoning, entity traversal

2026 Benchmark: LOCOMO (Long-Context Memory) is the emerging standard for evaluating agent memory quality across sessions. It tests recall, contradiction detection, and temporal reasoning over 50+ turn conversations. Use it to compare memory implementations.

Memory Architecture Patterns

┌─────────────────────────────────────────────────────────────────┐
│                    AGENT MEMORY ARCHITECTURE                     │
│                                                                   │
│  USER MESSAGE                                                     │
│       │                                                           │
│       ▼                                                           │
│  ┌──────────────────────────────────────────────────────────┐    │
│  │              MEMORY RETRIEVAL LAYER                       │    │
│  │                                                           │    │
│  │  1. Recent messages (sliding window / buffer)             │    │
│  │  2. Relevant past conversations (semantic search)         │    │
│  │  3. User profile & preferences (structured store)         │    │
│  │  4. Relevant knowledge (RAG)                              │    │
│  │  5. Task history & outcomes (episodic store)              │    │
│  │                                                           │    │
│  └─────────────────────┬────────────────────────────────────┘    │
│                        │                                          │
│                        ▼                                          │
│  ┌──────────────────────────────────────────────────────────┐    │
│  │              CONTEXT ASSEMBLY                             │    │
│  │                                                           │    │
│  │  System prompt                                            │    │
│  │  + Retrieved memories (ranked by relevance)               │    │
│  │  + Recent conversation (last N turns)                     │    │
│  │  + Current user message                                   │    │
│  │  = FINAL PROMPT (fits within context window)              │    │
│  │                                                           │    │
│  └─────────────────────┬────────────────────────────────────┘    │
│                        │                                          │
│                        ▼                                          │
│                    LLM GENERATES RESPONSE                         │
│                        │                                          │
│                        ▼                                          │
│  ┌──────────────────────────────────────────────────────────┐    │
│  │              MEMORY WRITE-BACK                            │    │
│  │                                                           │    │
│  │  1. Store conversation turn                               │    │
│  │  2. Extract & update user preferences                     │    │
│  │  3. Update task outcomes / success metrics                │    │
│  │  4. Summarize if buffer exceeds threshold                 │    │
│  │                                                           │    │
│  └──────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────┘

Pattern 1: Sliding Window (Buffer Memory)

The simplest — keep the last N messages in context.

Aspect Detail
How Store last N messages, include all in every prompt
Capacity Limited by context window (typically 10-50 turns)
Pros Simple, no infrastructure, preserves exact wording
Cons Loses old context, no learning, expensive for long conversations
Best for Short task-oriented conversations, prototypes

Pattern 2: Summarization Memory

Periodically summarize older messages to compress history.

Aspect Detail
How When buffer exceeds threshold, summarize older messages into a paragraph
Capacity Much longer conversations (100s of turns)
Pros Retains key information, fits in context window
Cons Lossy (details lost in summarization), adds latency and cost
Best for Long multi-turn conversations, support chatbots

Pattern 3: Semantic Memory (Vector Store)

Store and retrieve memories by relevance using embeddings.

Aspect Detail
How Embed each memory, store in vector DB, retrieve by similarity to current query
Capacity Unlimited — retrieve only what's relevant
Pros Scales to millions of memories, relevance-based recall
Cons May miss important context that isn't semantically similar to current query
Best for Long-term user memory, cross-session recall, knowledge-heavy agents

Pattern 4: Knowledge Graph Memory

Store structured relationships between entities.

Aspect Detail
How Extract entities and relationships from conversations, store in a graph
Capacity Unlimited, structured
Pros Rich relational reasoning, good for complex domains
Cons Complex to build, extraction accuracy matters, graph maintenance
Best for Domain-specific agents (medical, legal, research), relationship-heavy contexts

Pattern 5: Virtual Context Memory (Letta / MemGPT)

An OS-inspired approach where the agent manages its own memory via tool calls — reading, writing, and searching memory tiers autonomously.

Aspect Detail
How Three-tier memory hierarchy managed by the agent itself via memory tools
Capacity Unlimited — agent pages data in/out of context as needed
Pros Self-organizing, handles arbitrarily long histories, agent decides what to remember
Cons Extra LLM calls for memory management, complexity, requires reliable tool use
Best for Long-running agents, personalized assistants, agents that must learn over weeks/months
LETTA / MEMGPT ARCHITECTURE:

  ┌─────────────────────────────────────────────────────────┐
  │                   AGENT CONTEXT WINDOW                   │
  │                                                         │
  │  ┌─────────────────────────────────────────────────┐   │
  │  │  CORE MEMORY (always in context)                 │   │
  │  │  - System persona + user profile blocks          │   │
  │  │  - Agent can self-edit: core_memory_replace()    │   │
  │  │  Capacity: ~2K tokens (curated, high-value)      │   │
  │  └─────────────────────────────────────────────────┘   │
  │                         │                               │
  │                    pages in/out                          │
  │                         │                               │
  │  ┌─────────────────────────────────────────────────┐   │
  │  │  RECALL MEMORY (conversation search)             │   │
  │  │  - Full conversation log, searchable             │   │
  │  │  - Agent calls: conversation_search(query)       │   │
  │  │  Capacity: unlimited (retrieval-based)           │   │
  │  └─────────────────────────────────────────────────┘   │
  │                         │                               │
  │  ┌─────────────────────────────────────────────────┐   │
  │  │  ARCHIVAL MEMORY (persistent knowledge store)    │   │
  │  │  - Long-term facts, documents, learned knowledge │   │
  │  │  - Agent calls: archival_memory_insert/search()  │   │
  │  │  Capacity: unlimited (vector DB backed)          │   │
  │  └─────────────────────────────────────────────────┘   │
  └─────────────────────────────────────────────────────────┘

KEY INSIGHT: The agent is its own memory manager.
  - It decides what to remember (archival_memory_insert)
  - It decides what to recall (conversation_search, archival_memory_search)
  - It maintains its own profile (core_memory_replace)
  - Unlike RAG: memory writes are autonomous, not just retrieval

Letta vs Mem0 (2026 landscape):

Aspect Letta (MemGPT) Mem0
Approach OS-inspired, agent self-manages memory via tools Automated memory extraction + retrieval layer
Control Agent decides what to store/retrieve System automatically extracts and stores memories
Architecture Three-tier (core/recall/archival) Key-value memory with embedding search
Best for Agents needing autonomous memory management Simpler "remember user preferences" use cases
Maturity Production runtime (Letta Cloud + open-source) Open-source library + hosted API

★ Code & Implementation

LangGraph Agent with Conversation Memory

# pip install langgraph>=0.3 langchain-openai>=0.3 langchain-community>=0.3
# ⚠️ Last tested: 2026-04 | Requires: langgraph>=0.3

from langgraph.graph import StateGraph, MessagesState, START, END
from langgraph.checkpoint.memory import MemorySaver
from langchain_openai import ChatOpenAI

# 1. Setup model and memory checkpointer
model = ChatOpenAI(model="gpt-4o-mini", temperature=0)
memory = MemorySaver()  # In production: use PostgresSaver or RedisSaver

# 2. Define the agent node
def agent(state: MessagesState):
    """Agent node that processes messages with full conversation history."""
    response = model.invoke(state["messages"])
    return {"messages": [response]}

# 3. Build graph
graph = StateGraph(MessagesState)
graph.add_node("agent", agent)
graph.add_edge(START, "agent")
graph.add_edge("agent", END)
app = graph.compile(checkpointer=memory)

# 4. Use with persistent memory (thread_id = conversation session)
config = {"configurable": {"thread_id": "user_123_session_1"}}

# Turn 1
response = app.invoke(
    {"messages": [("user", "My name is Alex and I'm building a RAG system")]},
    config=config,
)
print(response["messages"][-1].content)

# Turn 2 — agent remembers Turn 1!
response = app.invoke(
    {"messages": [("user", "What am I working on?")]},
    config=config,
)
print(response["messages"][-1].content)
# Expected: "You mentioned you're building a RAG system, Alex!"

# Turn 3 — different session (no memory)
config_new = {"configurable": {"thread_id": "user_123_session_2"}}
response = app.invoke(
    {"messages": [("user", "What's my name?")]},
    config=config_new,
)
print(response["messages"][-1].content)
# Expected: "I don't know your name — this is our first interaction."

Semantic Memory with Vector Store

# pip install openai>=1.0 numpy>=1.24
# ⚠️ Last tested: 2026-04 | Requires: openai>=1.0

from openai import OpenAI
import numpy as np
import json
from datetime import datetime

client = OpenAI()

class SemanticMemory:
    """Simple semantic memory using embeddings for relevance-based recall."""

    def __init__(self):
        self.memories: list[dict] = []
        self.embeddings: list[np.ndarray] = []

    def _embed(self, text: str) -> np.ndarray:
        """Get embedding for text."""
        # Production: add tenacity @retry(wait=wait_exponential(min=1, max=10), stop=stop_after_attempt(3))
        try:
            response = client.embeddings.create(
                model="text-embedding-3-small",
                input=text,
            )
            return np.array(response.data[0].embedding)
        except Exception as e:
            # Production: log error, fall back to cached embedding or raise
            raise RuntimeError(f"Embedding API call failed: {e}") from e

    def store(self, content: str, metadata: dict = None):
        """Store a memory with its embedding."""
        embedding = self._embed(content)  # May raise on API failure
        self.memories.append({
            "content": content,
            "timestamp": datetime.now().isoformat(),
            "metadata": metadata or {},
        })
        self.embeddings.append(embedding)

    def recall(self, query: str, top_k: int = 3) -> list[dict]:
        """Retrieve most relevant memories for a query."""
        if not self.memories:
            return []

        query_emb = self._embed(query)
        similarities = [
            np.dot(query_emb, emb) / (np.linalg.norm(query_emb) * np.linalg.norm(emb))
            for emb in self.embeddings
        ]

        top_indices = np.argsort(similarities)[-top_k:][::-1]
        return [
            {**self.memories[i], "similarity": similarities[i]}
            for i in top_indices
            if similarities[i] > 0.3  # Minimum relevance threshold
        ]

# Usage
memory = SemanticMemory()

# Store memories from conversations
memory.store("User prefers Python over JavaScript for backend work")
memory.store("User is building a customer support chatbot for healthcare")
memory.store("User's company uses AWS and PostgreSQL")
memory.store("User asked about HIPAA compliance requirements")

# Recall relevant memories
results = memory.recall("What cloud provider should I use for my medical chatbot?")
for r in results:
    print(f"  [{r['similarity']:.2f}] {r['content']}")

# Expected output (ranked by relevance):
#   [0.82] User's company uses AWS and PostgreSQL
#   [0.78] User is building a customer support chatbot for healthcare
#   [0.65] User asked about HIPAA compliance requirements

◆ Quick Reference

MEMORY PATTERN DECISION GUIDE:

  Short conversation (< 20 turns)?     → Sliding window buffer
  Long conversation (20-200 turns)?     → Summarization memory
  Cross-session personalization?        → Semantic memory (vector store)
  Complex domain relationships?         → Knowledge graph memory
  Agent-managed long-term memory?       → Letta / MemGPT (virtual context)
  Production multi-user system?         → Vector store + structured user profiles

MEMORY STORAGE OPTIONS:
  Prototype:     In-memory (list/dict)
  Development:   SQLite + FAISS
  Production:    PostgreSQL + pgvector or Qdrant/Pinecone
  Enterprise:    Redis (hot) + PostgreSQL (cold) + vector DB (semantic)

MEMORY SIZING:
  1 conversation turn ≈ 100-500 tokens
  Context window budget for memory ≈ 30-50% of total context
  Semantic memory retrieval ≈ top 3-5 most relevant memories

◆ Production Failure Modes

Failure Symptoms Root Cause Mitigation
Memory poisoning Agent acts on incorrect "memories" from past conversations User manipulated memory entries, or extraction errors Validate memories before storage, add confidence scores, allow user correction
Irrelevant recall Agent brings up unrelated past context Embedding similarity too loose, no recency weighting Tune similarity threshold, add time decay, filter by topic
Context window overflow Agent crashes or truncates important context Too many memories retrieved, no budget management Set strict token budget per memory type, prioritize recent + relevant
Privacy leakage Agent shares one user's memories with another Incorrect memory isolation, shared vector namespace Per-user memory partitioning, tenant isolation in vector DB
Stale memory Agent uses outdated information about user No memory expiration or update mechanism TTL on memories, periodic refresh, user-triggered memory reset

○ Gotchas & Common Mistakes

  • ⚠️ More memory ≠ better responses: Stuffing too many memories into context confuses the model. Retrieve 3-5 most relevant, not 50.
  • ⚠️ Summarization is lossy: When you summarize old conversations, specific details (dates, numbers, names) are often lost. Store facts separately.
  • ⚠️ Embedding similarity ≠ importance: A memory can be highly similar to the current query but unimportant, or vice versa. Combine relevance with recency and importance scoring.
  • ⚠️ Memory writes have latency and cost: Each embedding call adds ~100ms and ~$0.0001. At scale (1000s of conversations/day), this adds up.

○ Interview Angles

  • Q: How would you implement long-term memory for a customer support agent?
  • A: I'd use a three-layer memory architecture. Layer 1: sliding window of the last 10 messages for immediate context. Layer 2: a structured user profile (name, plan, past issues) stored in PostgreSQL, updated after each conversation. Layer 3: semantic memory in a vector database for retrieving relevant past tickets and resolutions. On each new message, I'd retrieve the user profile + top 3 relevant past interactions and inject them into the system prompt. I'd budget 30% of context for memory, 20% for system prompt, and 50% for the current conversation. Memory writes happen asynchronously after each turn to avoid adding latency.

  • Q: What are the risks of giving an agent memory?

  • A: Four main risks. (1) Privacy: memories must be strictly isolated per user/tenant — a vector DB namespace leak would expose personal data. (2) Poisoning: users can intentionally inject false memories ("remember that I'm an admin") — validate and sanitize memory writes. (3) Staleness: preferences change but old memories persist — add TTLs and explicit update mechanisms. (4) Hallucinated memories: the LLM may "remember" things that never happened — always check retrieved memories against actual stored data, never rely on the model's internal "memory."

◆ Hands-On Exercises

Exercise 1: Build a Memory-Enabled Chatbot

Goal: Create a chatbot that remembers across sessions Time: 60 minutes Steps: 1. Build a basic chatbot using the LangGraph code above 2. Add semantic memory using the vector store implementation 3. Test: have 3 conversations, then start a 4th — does it recall relevant context? 4. Test memory isolation: create 2 users, verify they can't see each other's memories Expected Output: Working chatbot with cross-session memory, isolation verification


★ Connections

Relationship Topics
Builds on AI Agents, RAG, Embeddings
Leads to Multi-Agent Architectures (shared agent memory), Personalization systems
Compare with Database state management, session stores, user profiles
Cross-domain Cognitive science, information retrieval, database design

Type Resource Why
📘 Book "AI Engineering" by Chip Huyen (2025), Ch 7 Covers agent memory patterns in production context
🔧 Hands-on LangGraph Memory Tutorial Official guide to checkpointing and memory in LangGraph
🔧 Hands-on Mem0 Library Open-source long-term memory layer for AI agents
🔧 Hands-on Letta (MemGPT) Virtual context memory runtime — agent self-manages memory tiers
📄 Paper Packer et al. "MemGPT" (2023) OS-inspired virtual context management for LLM agents
📄 Paper Park et al. "Generative Agents" (2023) Stanford's simulation of memory-enabled AI agents in a virtual world

★ Sources

  • LangGraph Documentation — https://langchain-ai.github.io/langgraph/
  • Mem0 Documentation — https://docs.mem0.ai/
  • Letta (MemGPT) Documentation — https://docs.letta.com/
  • Packer et al. "MemGPT: Towards LLMs as Operating Systems" (2023) — https://arxiv.org/abs/2310.08560
  • Park et al. "Generative Agents: Interactive Simulacra of Human Behavior" (2023)
  • AI Agents
  • RAG