Paged KV: virtual memory for caches
Per-request contiguous KV buffers waste most of HBM. Treating KV like OS virtual memory — fixed-size physical pages addressed through a per-request page table — turns wasted GB back into useful capacity, at the cost of one indirection in the attention kernel.
The question this lesson answers
Lesson 02 derived that one Llama-3-70B request at 8K context owns 2.56 MB of KV. A naive engine sizes each request's KV buffer to max_context · per_token_bytes so attention can read it linearly. With max_context = 32K and 100 active sequences, that pre-allocates ~32 GB whether or not those sequences ever reach 32K — and on a 80 GB H100 that is the difference between serving 200 requests and 20. Most of the reserved bytes are never used. PagedAttention solves this by abandoning per-request contiguity.
The fragmentation problem with numbers
Suppose four requests with current lengths 800, 1500, 3000, and 5500 share an engine configured for a 32K max. Naive contiguous allocation reserves 4 × 32K = 128K tokens of KV regardless of actual use. Actual use is 800 + 1500 + 3000 + 5500 = 10,800 tokens. Internal fragmentation: 91.6%. The 91.6% is not "waste due to bad packing"; it's "waste because the engine cannot predict the final length and reserves for the worst case."
What a block table actually is
For each request, the engine maintains a small integer array — the block table — that maps logical block index (token range) to a physical block id in a global KV pool. Reading position t means:
logical_block = t // block_sizephysical_block = block_table[req][logical_block]kv_addr = pool_base + physical_block · block_size · per_token_bytes + (t % block_size) · per_token_bytes
The attention kernel does this lookup for every K and V it loads. Crucially, within a block the addresses are still contiguous, so coalesced reads work fine — the indirection only happens at block boundaries.
Block size is the central trade-off
Block size bs (tokens per page) trades three quantities:
- Tail waste: the partial last block per request. Expected loss per request ≈
bs/2tokens. Smaller bs → less tail waste. - Block-table size: entries per request ≈
T / bs. Smaller bs → bigger tables, more metadata moved to the GPU per step. - Kernel efficiency: within a block, addresses are contiguous; the kernel does one or two coalesced reads per block. Smaller bs → more indirection events per token attended.
Real systems usually settle at bs ∈ {8, 16, 32}. vLLM defaults to 16. The choice also has to satisfy backend constraints — e.g., some FlashInfer paths assume a specific page size and the engine will reject incompatible combinations rather than running silently slow.
| Block size | Tail waste (avg) | Table entries at T=8K | Trade-off feel |
|---|---|---|---|
| 1 | ~0 | 8192 | Maximum packing, maximum indexing overhead. |
| 16 | ~8 tokens | 512 | vLLM default; balanced. |
| 128 | ~64 tokens | 64 | Few entries, but a 4K-token request with 65 final tokens wastes a whole tail block. |
What changes inside the attention kernel
A FlashAttention-style kernel that supports paged KV adds two things to the inner loop:
# pseudo-code for the K/V tile loop inside paged FlashAttention
for j in range(num_logical_blocks):
phys = block_table[req, j] # one int lookup, in shared mem
K_tile = load(pool_K + phys * bs * Kd_bytes) # contiguous load within block
V_tile = load(pool_V + phys * bs * Kd_bytes)
# … exact same online-softmax math as before …
The block_table is loaded once per request into SMEM and re-used across heads. The cost of indirection is one extra instruction per block, dwarfed by the actual K/V read. Coalescing inside a block is preserved. The kernel does not have to support arbitrary scatter — only block-granular indirection.
The block manager's API
The block manager owns four operations. The attention kernel sees the results of these but never calls them directly:
| Operation | What it does | When called |
|---|---|---|
allocate(req, k_blocks) | Reserve k free physical blocks, append to req's block table. | On admission and when a sequence extends past its last block. |
free(req) | Return req's physical blocks to the free pool. | On sequence completion, eviction, or preemption. |
share(req_a, blocks, req_b) | Increment refcount on the listed blocks; point req_b's table at them. | On prefix-cache hit (lesson 05). |
cow(req, block_idx) | Clone block before mutation; redirect req's table to the clone. | When a shared block is about to be written by one of its owners. |
Interactive · block size accountant
Pick block size and average sequence length. The widget shows total tail waste vs block-table memory at a fixed number of active sequences. Note how both curves are convex around bs ≈ 16.
What you can now read in a config
vLLM flags like --block-size 16, --gpu-memory-utilization 0.9, --max-num-batched-tokens N, and --max-model-len all become legible:
block_sizepicks where on the curve you sit.gpu_memory_utilizationsets how many physical blocks the pool gets.max_num_batched_tokenscaps how much KV reading the kernel does per step.max_model_lenis a per-request upper bound on logical positions, not a contiguous reservation.
What this gives you for the rest of the track
We can now address KV at arbitrary granularity without internal fragmentation. The next questions write themselves: can we share blocks across requests when prefixes match? what data structure tracks "which prefixes have which blocks"? Lesson 05 builds exactly that, on top of the block table primitive we just installed.