KV recap and the prefix-sharing problem

Paged KV solved per-request fragmentation. It did not solve inter-request redundancy. To do that we need to ask, of every block in the pool: which token sequence does it belong to, and is anyone else asking for the same one? The data structure that answers cheaply is a tree of tokens.

The three numbers you have to keep in your head

If you've read the vLLM lessons, this is review. If not, here are the three quantities every later argument leans on:

Quantity	Formula	Llama-3-70B at fp16
KV bytes per token	`2 · n_kv_heads · head_dim · n_layers · dtype`	≈ 320 KB / token
Tokens per H100 (KV)	`(HBM · util − weights) / per_token_bytes`	~190K tokens (60 GB / 320 KB)
Prefill FLOPs per token	`2 · n_params` (rule of thumb)	~140 GFLOPs / token

For 32 calls each carrying a 2,500-token shared system prompt:

Redundant KV (if not shared): 32 × 2,500 × 320 KB = 25 GB — a third of an H100.
Redundant FLOPs (if not skipped): 32 × 2,500 × 140 GF ≈ 11 PFLOPs — about 15 seconds on an H100 at ~700 TF/s of sustained fp16 prefill throughput.

Both quantities collapse to one copy if the runtime can recognize the shared prefix. Memory drops 32×; prefill compute drops 32×. The question is how.

Paged KV — what we already have

Recall the paged layout from vLLM lesson 02: KV is stored in fixed-size physical blocks (16 tokens each, by default) in a global pool. Each request owns a block_table: list[int] mapping logical block index to physical block id.

same sys prompt physical KV pool (after naive serve) … 04 05 06 07 08 … … 11 Blocks 04–06 and 07–08+11 hold the same KV bytes. Paging let us pack the pool, but didn't notice the duplication. Both requests prefilled the system prompt independently.

This is the central observation. Paged KV is necessary for sharing but not sufficient. It gives us blocks that can be pointed to by multiple block tables; it does not give us the index that says "block 04 contains the same KV as block 07."

vLLM's answer: hash-and-share

vLLM's Automatic Prefix Cache (APC) plugs the gap with a hashmap. The key per block is a chained hash of the token ids that produced it:

hash_i = H(hash_i-1, tokens[i·B : (i+1)·B])

where B is the block size. Two requests whose first kB tokens are identical produce identical hashes for blocks 0..k-1, and the cache returns the existing physical block instead of allocating a new one. This works well when:

Prefixes are block-aligned. If two prompts share 1,500 tokens but the boundary lands mid-block, the last partial block won't hash-match.
The shared prefix is continuous from token 0. APC only recovers prefix overlap; it cannot share a middle segment (e.g., a retrieved document that sits at position 1,500 in one request and position 2,000 in another).
Eviction policy keeps shared blocks alive long enough to match. LRU on a fully-saturated pool will lose shared blocks first if their last access was longer ago than newer per-request blocks.

These constraints are not flaws — they're a consequence of a flat hashmap. The data structure cannot express "this prefix is a prefix-of-a-prefix-of …" relationships, only "exact match." The next step up is a structure that does express that.

The structure we want: tokens form a tree

Take all the active prompts in your system right now and ask: do they share prefixes? In an agent-heavy workload the answer is yes, structurally. Visualize the set of all live prompts as a tree:

Each edge in this tree corresponds to a run of tokens, and the KV for those tokens lives in one set of physical blocks. The blue parent block is held by 32 live requests; its refcount is 32. Each orange child has its own KV but shares the parent's. When a request finishes, the leaf is freed and refcounts decrement up the tree.

If we can build, maintain, and query this tree in a few microseconds per request, we get:

Exact prefix matching at any granularity. No block-alignment constraint — the tree splits whenever two requests diverge.
Middle-of-prompt sharing. A retrieved document at any depth gets its own subtree; any future request hitting the same document and the same prior context lands on it.
Cache-aware scheduling for free. The next batch is the set of requests whose longest-prefix-match lands deepest in the tree.
Bounded memory overhead. The tree is over unique prefixes, not over requests; usually tiny relative to KV.

The reframing in one sentence

Paged KV asks "where does this request's KV live?". RadixAttention asks "what token sequence does each block hold, and who else is asking for it?". The two questions are answered by different data structures over the same pool.

Why a radix tree, specifically

A plain trie has one edge per token — for a 2,500-token system prompt, 2,500 nodes. A radix tree (a.k.a. PATRICIA trie) collapses any chain of single-child nodes into one edge labeled with the whole run. The whole system prompt is one edge; only at a branch (where two requests' prompts differ) does a node appear. This is the difference between "one node per token" and "one node per branching point":

Structure	Nodes for the figure above	Lookup cost
Trie (per token)	~2,600+	O(prompt length) traversals
Radix tree (per branch)	~6	O(depth × longest-shared-edge / B)

For prefix sharing the radix tree wins on every axis: smaller, faster, and the data we care about (KV blocks) lives on the long edges where it should.

What still has to be true

Three constraints carry over from the underlying pool:

KV is still paged. Each edge holds one or more block ids in the global pool. The radix tree is the index over those blocks; it does not own them physically.
Refcounts still gate frees. A block in the pool is free-able only when no tree node references it.
Attention kernel reads through the same block table. Each request still has a per-request block_table assembled by walking the tree from root to its leaf. Once assembled, the attention kernel sees the usual (logical, block_id, offset) pattern.

In other words, RadixAttention is an indexing change, not a kernel change. The attention kernel doesn't know it exists. This is what makes it cheap to ship: any paged-KV backend (FlashInfer, FlashAttention with paging, FA3) plugs in unmodified.

Interactive · what a hashmap misses

Try two near-duplicate prompts (different only at a middle segment). The widget shows how many tokens each cache type recovers.

What lesson 04 builds

Lesson 04 fills in the radix tree explicitly: node layout, insert, lookup, evict, refcount semantics, the copy-on-write rule when one of multiple sharers writes. Lesson 05 then shows how the scheduler reads the tree to order admissions so that batches are dense in shared prefixes.