all_lessons/agentic_systems/10 · memory managementlesson 10 / 25

Part IV - State, collaboration, and protocols

Memory management - session, state, long-term knowledge

Lesson 09 gave us several agents working in parallel; the moment any of them needs to remember a fact from yesterday — or even from three tool calls ago, once the transcript overflows the window — we need memory. The book's central claim is the one beginners get wrong most often: memory is not "a longer transcript." It is scoped storage with explicit write and read policies. This lesson builds that distinction from the bottom up using Google ADK's concrete vocabulary, then the LangChain/LangGraph parallels, so you can reason about where a fact lives, when it gets written, and who may read it back.

Book source
Chapter 8 - Memory Management (记忆管理); PDF outline pages 97-110. ADK Session/State/MemoryService, LangChain ConversationBufferMemory, LangGraph BaseStore, and Vertex AI Memory Bank.
The plan
Five moves. (1) Separate short-term (context) memory from long-term (persistent) memory, and see why a bigger context window does not solve persistence. (2) Make the cost concrete — work the token arithmetic that forces summarization. (3) Pin down ADK's three primitives — Session, State, MemoryService — and the user:/app:/temp: scope prefixes. (4) Climb the long-term taxonomy the book borrows from cognitive science — semantic, episodic, procedural memory — with the LangChain/LangGraph/Vertex Memory Bank tools for each. (5) Turn it into a discipline: a write rule, a read rule, and provenance, threaded through the running coding/research assistant.
Linear position
Prerequisite: Lesson 09 (multi-agent collaboration) — you have stateful tasks, possibly several roles, and the question "where does shared knowledge live?" is now unavoidable. You also rely on the prompt/context contract from lesson 02: memory is one more thing that fills the window.
New capability: Durable, scoped knowledge that survives across turns, sessions, users, and projects — written on purpose, read by relevance, and carrying its source.

1 · Two memories, because the window is temporary

The book opens with an analogy: an agent needs different kinds of memory the way a person does. Hold that loosely — the engineering split is sharper than the biology. There are exactly two regimes, and they fail for different reasons.

Short-term memory (context memory) is the working memory of the agent: the recent messages, the agent's own replies, tool-call results, and reflections from lesson 06. For an LLM agent this is the context window. It is immediate and free to read (the model already attends to it), but it is bounded and temporary: when the session ends, it is gone. The book is blunt about a tempting non-solution — "long-context" models. A larger window only enlarges short-term memory; it lets a single interaction hold more, but the content is still ephemeral, still lost at session end, and still expensive to reprocess on every turn. A 1M-token window does not give you a memory that survives until tomorrow.

Long-term memory (persistent memory) is a repository that lives outside the agent — a database, a knowledge graph, or most commonly a vector database. Facts are stored, and later retrieved by semantic similarity (semantic search) rather than exact keyword match: the agent embeds its current need into a vector, finds the nearest stored vectors, pulls the matching records back, and folds them into the short-term context for the current step. That last move is the whole trick — long-term memory is only useful insofar as the right slice of it gets promoted into the window at the right moment. This is the same machinery lesson 16 (RAG) builds out in full; memory is RAG pointed at the agent's own history.

┌─────────────────────── SHORT-TERM (context window) ───────────────────────┐ │ system prompt │ recent turns │ tool results │ reflections │ RETRIEVED mem │ └───────────────▲───────────────────────────────────────────────▲──────────┘ │ written every turn (cheap, automatic) │ promoted on demand │ │ (semantic search) │ ┌──────┴───────────┐ └── overflows? summarize / drop ──────────▶│ LONG-TERM store │ │ db / graph / vec │ │ survives session │ └──────────────────┘

So the two memories are not a hierarchy of "small then big." They are different substrates: one is the prompt the model sees this turn (temporary, attended-to, costed per token every time); the other is durable storage you must deliberately query and splice in. Confusing them is the failure mode that opens the failure-modes card below.

2 · Why the window forces a budget — worked numbers

Beginners assume that if a fact is "in the conversation" the agent will remember it. Two things break that assumption: the window is finite, and every token in it is paid for on every turn. Let's make both concrete with the running example — a coding/research assistant helping a user across a long debugging session.

Suppose the model has a 128k-token context window and our assistant has accumulated a transcript:

system + tools
~4,000 tok
per turn (user+agent+tool logs)
~2,500 tok
turns before overflow
(128k−4k)/2.5k ≈ 49
cost to re-read full ctx @ turn 49
~124k tok / turn

The cost arithmetic. By turn 49 the window is full and every further turn reprocesses ~124k input tokens. At a representative input price of $3 per million tokens, one turn's input alone costs 124{,}000 × $3 / 10^6 ≈ $0.37 — and that recurs on every subsequent turn, even though 95% of those tokens are stale tool logs the model no longer needs. Ten more turns is $3.70 spent re-reading history. This is the book's point that processing the entire context every time is "costly and inefficient."

The fix is short-term memory management: compress the old transcript so it stops growing without dropping what matters. The two standard moves the chapter names are summarization (replace 30 stale turns with a 400-token summary) and salience selection (keep the key facts, drop the chatter). Summarizing turns 1–40 into 400 tokens reclaims roughly 40 × 2{,}500 − 400 = 99{,}600 tokens — the window goes from full back to ~25% used, and per-turn input cost drops by ~80%.

Trap
Summarization is lossy on purpose, and the loss is silent. If the assistant summarizes away the exact stack trace and later needs the line number, it has no way to know it threw the fact out — the summary reads complete. The discipline: anything you might need verbatim later goes to long-term memory (durable, retrievable) before you compress it out of the window. Summarize the narrative; persist the facts.

3 · ADK's three primitives — Session, State, MemoryService

The book's most concrete contribution is Google ADK's vocabulary, because it forces the scoping question into the type system. There are three concepts and two services.

Session
One chat thread. Holds the event log (events — the full message history) and the thread's temporary data (state). Identified by id, app_name, user_id. Created/recorded/ended by SessionService.
State (session.state)
The session's scratchpad: a dict of string keys → serializable primitives (str, num, bool, list, dict). Holds user prefs, task progress, flags that steer the next step. Short-term.
MemoryService
The long-term knowledge store: a searchable repository spanning many sessions and external data. Interface BaseMemoryService with add_session_to_memory and search_memory.

Session and State are short-term; MemoryService is long-term. That single sentence is the lesson. The same backing service comes in test and production flavors — InMemorySessionService / InMemoryMemoryService lose everything on restart and are for tests; DatabaseSessionService (e.g. SQLite) and VertexAiSessionService / VertexAiRagMemoryService persist. Choosing the service is choosing your persistence and scalability story.

The append-event loop. ADK is opinionated about how state changes, and the reason is exactly the multi-agent and recovery concerns from the surrounding lessons. The flow per message: the Runner gets or creates a Session via SessionService; the agent reads the session's context (state + history) and produces a response, possibly with a state update; the Runner wraps it as an Event and calls session_service.append_event(...), which records the event and applies the state change.

message ─▶ Runner ─get/create─▶ Session ─(state+events)─▶ Agent │ response (+ state change) ▼ Session ◀─persist+record── append_event(Event) ◀── Runner

Why not just mutate session.state["x"] = 1 directly after fetching the session? The book explicitly warns against it: a direct write bypasses event handling, so the change is not recorded, not persisted, can race under concurrency, and never updates metadata like last_update_time. The two sanctioned write paths are:

Scope by key prefix. ADK encodes who shares a fact and how long it lives directly in the key name — the cleanest expression of "memory is scoped storage" in the whole chapter:

PrefixScopeLifetimeExample key
(none)This session onlyThe chat threadtask_status
user:This user, across their sessionsPersists per useruser:login_count
app:All users of the appApp-wideapp:feature_flags
temp:This turn onlyNot persisted at alltemp:validation_needed

The book's worked tool shows all four at once. A log_user_login tool, called through a ToolContext, writes state["user:login_count"] = n+1 (follows the user across sessions), state["task_status"] = "active" (this session), state["user:last_login_ts"] = now, and state["temp:validation_needed"] = True (gone after this turn). One function, four lifetimes — chosen by prefix, applied through the event flow:

def log_user_login(tool_context: ToolContext) -> dict:
    state = tool_context.state
    login_count = state.get("user:login_count", 0) + 1
    state["user:login_count"]    = login_count   # cross-session, per user
    state["task_status"]         = "active"       # this session only
    state["user:last_login_ts"]  = time.time()    # cross-session, per user
    state["temp:validation_needed"] = True        # this turn only, not persisted
    return {"status": "success", "logins": login_count}
# called via append_event, NOT by mutating session.state directly

Long-term knowledge moves through MemoryService: add_session_to_memory(session) ingests a finished conversation, and search_memory(query) retrieves relevant slices later. In production ADK recommends VertexAiRagMemoryService, which adds semantic retrieval over a RAG corpus with the familiar knobs — similarity_top_k=5, vector_distance_threshold=0.7 — i.e. "return the 5 nearest records, but only if they're closer than 0.7." Those two numbers are your precision/recall dial for memory recall, and lesson 16 explains them properly.

4 · The long-term taxonomy — semantic, episodic, procedural

Once memory is durable, "what kind of thing am I storing?" matters, because facts, experiences, and rules want different storage, retrieval, and update policies. LangChain/LangGraph borrow the cognitive-science triad, and it maps cleanly onto agent engineering:

Semantic — facts
Concrete facts and concepts: user preferences, domain knowledge. Stored as a user "profile" (a JSON doc) or a collection of facts. "The user only writes Python and English."
Episodic — experiences
Past events and how a task was done. Often realized as few-shot examples in the prompt, so the agent imitates a known-good sequence. "Last time, fixing this bug took these 4 steps."
Procedural — rules
How to perform the task: the agent's core instructions/behavior, usually living in the system prompt. Updated via the reflection technique from lesson 06. "Always run lint before proposing a diff."

Procedural memory is the interesting one because it closes a loop with reflection. The book sketches an agent that rewrites its own instructions: an update_instructions step reads the current instructions from a LangGraph BaseStore, asks the LLM to revise them in light of the latest conversation, and writes the new instructions back under a namespace key. Next run, call_model loads those evolved instructions into the prompt. The agent has learned a rule — and persisted it — without any weight update. That is exactly the bridge into lesson 11 (learning and adaptation).

Storage shape (LangGraph). Long-term memory is JSON documents organized by a namespace (like a folder) and a key (like a filename), with the standard put / get / search verbs. The namespace is usually (user_id, context) — scoping again, now in the path rather than a prefix:

store = InMemoryStore(index={"embed": embed_fn, "dims": 2})
namespace = ("my-user", "chitchat")          # (user_id, context) = scope
store.put(namespace, "a-memory", {
    "rules": ["user prefers short, direct language", "user only speaks English and Python"],
})
item  = store.get(namespace, "a-memory")     # exact fetch by key
items = store.search(namespace, query="language preference")  # semantic recall

The framework map, so you can place any tool you meet:

NeedADKLangChain / LangGraphVertex
Short-term, this threadSession + session.stateLangGraph state + checkpointer; ConversationBufferMemoryVertexAiSessionService
Long-term, persistentMemoryServiceLangGraph BaseStore / InMemoryStoreVertexAiRagMemoryService, Memory Bank
Auto-extract durable factsadd_session_to_memoryreflection → store.putMemory Bank (async Gemini extraction)

The book highlights Vertex AI Memory Bank as the managed end of this spectrum: a service that uses a Gemini model to asynchronously analyze conversation history, extract key facts and preferences, store them scoped by user ID, and intelligently reconcile contradictions as new data arrives. On a new session it recalls relevant memories (full callback or embedding similarity) for continuity. The agent's runner talks to it via VertexAiMemoryBankService, and it plugs into ADK out of the box and into LangGraph/CrewAI through the API. Note the contradiction-resolution step: that is the difference between a memory store and a memory append-only log — real long-term memory has to update and overwrite, not just accumulate.

5 · Memory as a discipline — write rule, read rule, provenance

Everything above is mechanism. The discipline the chapter insists on is that durable memory must be managed: deciding what is stored, how it is updated, and how it is retrieved. Three rules turn the mechanism into something safe to let influence future behavior.

The write rule. Do not persist everything. A durable memory should be (a) stable — not a one-off; (b) useful in the future, not just now; (c) attributable — it carries the trace it came from; and (d) safe to reuse — permissioned to the right scope. In the running example: after the assistant verifies that npm test -- api actually runs the API tests in this repo, it writes that as project memory with a source trace and an expiry condition. A command that just failed once is not promoted to a global fact.

memory_write = {
  "scope": "project",                         # not user, not global
  "fact": "Run targeted API tests with: npm test -- api",
  "source_trace": trace_id,                   # provenance — where did this come from?
  "confidence": "verified",                   # vs "inferred"/"hint"
  "expires_when": "package test script changes",
}
if stable_and_useful(memory_write) and permitted(scope):
    memory.store(memory_write)

The read rule. Retrieve by scope first, then relevance. A project fact is read when working in that project; a user preference travels with that user; a temp: flag never escapes the turn. Never let a private user fact leak into an unrelated task — that is a scoping bug with privacy consequences, not a recall miss.

Provenance and confidence. The agent must know when a memory is evidence (a verified fact it can act on) versus a hint (an inference that must be re-verified before use). Carrying confidence and source_trace on every record is what lets a downstream step decide whether to trust a recalled "fact" or to re-check it — and it is what makes memory reviewable when something goes wrong.

Failure modes

  • Window as memory. Treating a long context as long-term storage — it vanishes at session end and costs full price every turn.
  • Unverified facts. Storing an inference or a one-off result as a durable fact, then trusting it later as evidence.
  • Scope leak. Retrieving private user: facts into an unrelated task or another user's session.
  • Direct state mutation. Writing session.state[k]=v outside append_event — unrecorded, unpersisted, race-prone.
  • Silent summarization loss. Compressing away a fact (a stack trace, a constraint) you later need verbatim, with no signal it's gone.
  • No reconciliation. Append-only memory that never overwrites contradicts itself over time (the problem Memory Bank's reconcile step solves).

Implementation checklist

  • What memory types exist (semantic / episodic / procedural) and where does each live?
  • What triggers a write — and does it pass stable + useful + attributable + permissioned?
  • What scope does each fact get (none/user:/app:/temp: or namespace)?
  • Who may read each scope, and does retrieval go scope-then-relevance?
  • What's the short-term budget — when do you summarize vs. persist-then-drop?
  • How is a memory corrected, reconciled, or deleted? Does it carry provenance and confidence?

Where this points next

We can now hold knowledge across turns, sessions, users, and projects — written deliberately, scoped explicitly, retrieved by relevance, and tagged with where it came from. The procedural-memory loop in §4 already hinted at the next move: an agent that rewrote its own instructions from a conversation didn't just remember — it changed how it behaves. Lesson 11 (Learning and adaptation) makes that the subject: how an agent improves from experience using traces, feedback, and system updates (prompts, instructions, memory contents) before ever touching model weights. Memory is the substrate that learning writes to; learning is the policy that decides what's worth writing.

Takeaway
An agent has two memories. Short-term (context) memory is the window — immediate, attended-to, but temporary and paid for on every turn, so it needs a budget (summarize / select). Long-term (persistent) memory is external storage (db / vector store) retrieved by semantic similarity and folded back into context. ADK names the parts: Session + State (short-term, scoped by user:/app:/temp: prefixes, updated only through append_event) and MemoryService (long-term). Long-term knowledge splits into semantic (facts), episodic (experiences), and procedural (rules) — the last reflectively self-updating. The discipline is three rules: write only what's stable, useful, attributable, and permissioned; read by scope then relevance; carry provenance and confidence. Memory is scoped storage with policies — not a longer transcript.

Interview prompts