all_lessons / sglang / 03 · KV recap lesson 3 / 11

KV recap and the prefix-sharing problem

Paged KV solved per-request fragmentation. It did not solve inter-request redundancy. To do that we need to ask, of every block in the pool: which token sequence does it belong to, and is anyone else asking for the same one? The data structure that answers cheaply is a tree of tokens.

The three numbers you have to keep in your head

If you've read the vLLM lessons, this is review. If not, here are the three quantities every later argument leans on:

QuantityFormulaLlama-3-70B at fp16
KV bytes per token2 · n_kv_heads · head_dim · n_layers · dtype≈ 320 KB / token
Tokens per H100 (KV)(HBM · util − weights) / per_token_bytes~190K tokens (60 GB / 320 KB)
Prefill FLOPs per token2 · n_params (rule of thumb)~140 GFLOPs / token

For 32 calls each carrying a 2,500-token shared system prompt:

Both quantities collapse to one copy if the runtime can recognize the shared prefix. Memory drops 32×; prefill compute drops 32×. The question is how.

Paged KV — what we already have

Recall the paged layout from vLLM lesson 02: KV is stored in fixed-size physical blocks (16 tokens each, by default) in a global pool. Each request owns a block_table: list[int] mapping logical block index to physical block id.

request A block_table [0]→07 [1]→08 [2]→11 // tokens 0–47 of sys prompt request B block_table [0]→04 [1]→05 [2]→06 // tokens 0–47 of same sys prompt physical KV pool (after naive serve) 04 05 06 07 08 11 Blocks 04–06 and 07–08+11 hold the same KV bytes. Paging let us pack the pool, but didn't notice the duplication. Both requests prefilled the system prompt independently.

This is the central observation. Paged KV is necessary for sharing but not sufficient. It gives us blocks that can be pointed to by multiple block tables; it does not give us the index that says "block 04 contains the same KV as block 07."

vLLM's answer: hash-and-share

vLLM's Automatic Prefix Cache (APC) plugs the gap with a hashmap. The key per block is a chained hash of the token ids that produced it:

hashi = H(hashi-1, tokens[i·B : (i+1)·B])

where B is the block size. Two requests whose first kB tokens are identical produce identical hashes for blocks 0..k-1, and the cache returns the existing physical block instead of allocating a new one. This works well when:

These constraints are not flaws — they're a consequence of a flat hashmap. The data structure cannot express "this prefix is a prefix-of-a-prefix-of …" relationships, only "exact match." The next step up is a structure that does express that.

The structure we want: tokens form a tree

Take all the active prompts in your system right now and ask: do they share prefixes? In an agent-heavy workload the answer is yes, structurally. Visualize the set of all live prompts as a tree:

ROOT (∅) "You are an assistant…" 2,500 tokens · refcount = 32 "User: weather in NYC?" 85 tokens · refcount = 3 "User: what is 17 × 29?" 22 tokens · refcount = 5 "User: write a haiku…" 19 tokens · refcount = 24 answer A answer B depth = token position colors: shared / branching / unique

Each edge in this tree corresponds to a run of tokens, and the KV for those tokens lives in one set of physical blocks. The blue parent block is held by 32 live requests; its refcount is 32. Each orange child has its own KV but shares the parent's. When a request finishes, the leaf is freed and refcounts decrement up the tree.

If we can build, maintain, and query this tree in a few microseconds per request, we get:

The reframing in one sentence
Paged KV asks "where does this request's KV live?". RadixAttention asks "what token sequence does each block hold, and who else is asking for it?". The two questions are answered by different data structures over the same pool.

Why a radix tree, specifically

A plain trie has one edge per token — for a 2,500-token system prompt, 2,500 nodes. A radix tree (a.k.a. PATRICIA trie) collapses any chain of single-child nodes into one edge labeled with the whole run. The whole system prompt is one edge; only at a branch (where two requests' prompts differ) does a node appear. This is the difference between "one node per token" and "one node per branching point":

StructureNodes for the figure aboveLookup cost
Trie (per token)~2,600+O(prompt length) traversals
Radix tree (per branch)~6O(depth × longest-shared-edge / B)

For prefix sharing the radix tree wins on every axis: smaller, faster, and the data we care about (KV blocks) lives on the long edges where it should.

What still has to be true

Three constraints carry over from the underlying pool:

  1. KV is still paged. Each edge holds one or more block ids in the global pool. The radix tree is the index over those blocks; it does not own them physically.
  2. Refcounts still gate frees. A block in the pool is free-able only when no tree node references it.
  3. Attention kernel reads through the same block table. Each request still has a per-request block_table assembled by walking the tree from root to its leaf. Once assembled, the attention kernel sees the usual (logical, block_id, offset) pattern.

In other words, RadixAttention is an indexing change, not a kernel change. The attention kernel doesn't know it exists. This is what makes it cheap to ship: any paged-KV backend (FlashInfer, FlashAttention with paging, FA3) plugs in unmodified.

Interactive · what a hashmap misses

Try two near-duplicate prompts (different only at a middle segment). The widget shows how many tokens each cache type recovers.

Hashmap APC vs radix tree — what gets shared

A radix tree shares the longest common prefix, even mid-block. A hashmap only shares block-aligned exact-match prefixes.

What lesson 04 builds

Lesson 04 fills in the radix tree explicitly: node layout, insert, lookup, evict, refcount semantics, the copy-on-write rule when one of multiple sharers writes. Lesson 05 then shows how the scheduler reads the tree to order admissions so that batches are dense in shared prefixes.