Prefix reuse: hash caches and radix trees
Paged KV (lesson 04) gave us blocks we can reference by id. The next leverage point is that many real workloads send prompts whose first thousands of tokens are bit-identical. If two requests share a prefix, they should share the KV blocks computed for that prefix — no recomputation, no duplicate storage.
The question this lesson answers
Why does this matter, and why now? Because real workloads do share prefixes:
- Chat APIs send the same system prompt for every request.
- RAG sends the same retrieved document with different questions.
- Multi-turn conversations send the full prior history each turn.
- RL rollouts and tool-use agents branch from a common context.
For Llama-3-70B at a 4K shared prefix, recomputing it for every request costs ~4 GB of weight reads × all 80 layers per fresh prefill. If a fleet processes 1000 such requests, that is 1000× wasted prefill. Lesson 02's math told us how expensive this is; this lesson tells us how to skip it.
Two data structures, one idea
"Have I seen this prefix before?" is a string-search problem. Two practical answers dominate production engines:
| Structure | How it indexes | Strengths | Costs |
|---|---|---|---|
| Block-hash cache (vLLM-style) | One hash per fixed-size block of tokens; map hash → physical block. | O(1) lookup per block; cheap to integrate with paged KV; works with any token stream. | Only block-aligned reuse; a partial block prefix doesn't hit. |
| Radix tree (SGLang-style) | Tree whose edges hold variable-length token strings; nodes hold KV ranges. | Arbitrary-length and branch-aware reuse; natural for tree-structured workloads (tools, branching agents). | More bookkeeping; insert/split operations need careful concurrency control. |
Both produce the same kernel-facing artifact: a list of physical block ids the attention kernel can read directly, plus a small suffix that still needs prefill. The kernels from lessons 03 and 04 don't change.
Block-hash prefix cache
Pick a block size bs (the same one paged KV uses). For each prompt, walk it in blocks. Hash the first bs tokens to get h_0. Hash (h_0, next bs tokens) to get h_1. And so on. Hash i depends on hash i-1, so two prefixes collide if and only if they are identical token-by-token (up to hash collisions, which use 128-bit hashes to make negligible).
# pseudo-code: prefix lookup, block-hash style
def lookup_prefix(tokens, bs, cache):
h, hits = None, []
for k in range(0, len(tokens), bs):
chunk = tokens[k:k+bs]
if len(chunk) < bs: break # partial block: no hit
h = hash128((h, chunk)) # chained hash
entry = cache.get(h)
if entry is None: break # first miss ends the run
entry.refcount += 1 # share, don't copy
hits.append(entry.physical_block)
return hits # list of physical block ids
Insertion happens after prefill: every freshly computed block is registered under its hash. Eviction is usually LRU with refcounts — a block can only be evicted if no live request points at it.
Radix tree, when prompts branch
If your workload routinely has k requests sharing a long prefix and then branching, a flat hash table is fine but loses structure. A radix tree (a trie that merges single-child chains into edge labels) represents the branching directly: shared paths are stored once; divergent suffixes are siblings.
Operations on a radix cache fall into four buckets:
| Op | What happens |
|---|---|
| match | Walk from root following tokens; return matched length and the physical blocks along the path. Increment refcounts. |
| insert | Append new (token, block) pairs at the matched leaf. Split an existing edge if the new path diverges mid-edge. |
| evict | Pick the LRU leaf with zero refcount, free its blocks, remove it. Possibly merge a parent that now has a single child. |
| lock | Pin a path during in-flight read; prevents eviction mid-attention. |
Page alignment and the suffix problem
Both schemes can only reuse complete blocks. If your prompt's shared prefix length is 4096 tokens but block size is 16, the engine can reuse 4096/16 = 256 blocks cleanly. If the shared length is 4097, the 257th block has one matching token and 15 fresh ones — that block can not be shared; it has to be recomputed. The engine treats the prefix match length as "shared length rounded down to bs."
The same constraint applies on the inside of the kernel: an attention backend that requires a particular page size (lesson 04) will only accept matched blocks of that size. This is one of the silent reasons "backend X doesn't support feature Y" — the cache layout and the kernel layout have to agree.
What the engine actually sends to the kernel
After a prefix lookup, the engine produces three things per request for the next forward pass:
- The block table (lesson 04), with the prefix entries already pointing at the cached physical blocks.
- A "fresh tokens" suffix that still needs prefill (or none, if every block hit).
- A flag set telling the attention kernel which positions are "already in KV, just read me" vs "compute and append."
For the kernel this is the same code path as a fresh request — it walks the block table and the new token positions. The win is purely "fewer fresh tokens to prefill" + "no duplicate KV bytes in HBM."
Cache-aware routing
Prefix reuse only helps if a request reaches a worker that has the right cache. If your router load-balances purely on instantaneous queue depth, the second request of a chat may land on a different worker than the first and miss the cache entirely. Production routers either:
- Hash the prefix prefix (first few hundred tokens) and route by hash, so identical prefixes stick to a worker — at the cost of some load imbalance.
- Track per-worker cache state and route the request to the worker with the best-matching cache, breaking ties by load.
SGLang's gateway exposes this directly as "cache-aware routing." vLLM's KV connector and prefix-cache flags expose related primitives. The point is that the cache only pays off when routing helps it.
Interactive · prefix savings accountant
How much prefill compute disappears? Compute it. Move the sliders for shared prefix tokens, unique suffix tokens, and number of requests sharing the prefix. The widget shows total tokens prefilled with and without sharing, plus the percentage saved.
Limits and failure modes
| Failure mode | What happens | What to do |
|---|---|---|
| Cache thrashing | The working set of distinct prefixes exceeds the pool; entries are evicted before reuse. | Increase KV pool fraction; route by prefix to concentrate working set per worker. |
| Cold start | First request of a new prefix pays full prefill; subsequent hits are free. | Pre-warm with anticipated system prompts; share across replicas if your router supports it. |
| Subtle non-determinism | Identical token sequence produces "slightly different" logits across cache hit vs miss runs (numerical reordering). | Accept tiny floating-point drift, or pin to a single backend if you need exact reproducibility (RL training pairs are sensitive). |
| Sensitive prompts | Two users share a system prompt but should not share KV across security boundaries. | Partition cache by tenant; the runtime can hash a tenant id into the chain. |
What this gives you for the rest of the track
We can now compute on every byte exactly zero or one times. The remaining wins are about when work runs, not whether: bundling unrelated requests into one efficient kernel call, splitting long prompts so they don't starve decoders, and removing CPU dispatch overhead from a stable hot path. That is the next lesson — scheduling.