Paged KV: virtual memory for caches

Per-request contiguous KV buffers waste most of HBM. Treating KV like OS virtual memory — fixed-size physical pages addressed through a per-request page table — turns wasted GB back into useful capacity, at the cost of one indirection in the attention kernel.

The question this lesson answers

Lesson 02 derived that one Llama-3-70B request at 8K context owns 2.56 MB of KV. A naive engine sizes each request's KV buffer to max_context · per_token_bytes so attention can read it linearly. With max_context = 32K and 100 active sequences, that pre-allocates ~32 GB whether or not those sequences ever reach 32K — and on a 80 GB H100 that is the difference between serving 200 requests and 20. Most of the reserved bytes are never used. PagedAttention solves this by abandoning per-request contiguity.

The fragmentation problem with numbers

Suppose four requests with current lengths 800, 1500, 3000, and 5500 share an engine configured for a 32K max. Naive contiguous allocation reserves 4 × 32K = 128K tokens of KV regardless of actual use. Actual use is 800 + 1500 + 3000 + 5500 = 10,800 tokens. Internal fragmentation: 91.6%. The 91.6% is not "waste due to bad packing"; it's "waste because the engine cannot predict the final length and reserves for the worst case."

block table instead of a single pointer.

What a block table actually is

For each request, the engine maintains a small integer array — the block table — that maps logical block index (token range) to a physical block id in a global KV pool. Reading position t means:

logical_block = t // block_size
physical_block = block_table[req][logical_block]
kv_addr = pool_base + physical_block · block_size · per_token_bytes + (t % block_size) · per_token_bytes

The attention kernel does this lookup for every K and V it loads. Crucially, within a block the addresses are still contiguous, so coalesced reads work fine — the indirection only happens at block boundaries.

Block size is the central trade-off

Block size bs (tokens per page) trades three quantities:

Tail waste: the partial last block per request. Expected loss per request ≈ bs/2 tokens. Smaller bs → less tail waste.
Block-table size: entries per request ≈ T / bs. Smaller bs → bigger tables, more metadata moved to the GPU per step.
Kernel efficiency: within a block, addresses are contiguous; the kernel does one or two coalesced reads per block. Smaller bs → more indirection events per token attended.

Real systems usually settle at bs ∈ {8, 16, 32}. vLLM defaults to 16. The choice also has to satisfy backend constraints — e.g., some FlashInfer paths assume a specific page size and the engine will reject incompatible combinations rather than running silently slow.

Block size	Tail waste (avg)	Table entries at T=8K	Trade-off feel
1	~0	8192	Maximum packing, maximum indexing overhead.
16	~8 tokens	512	vLLM default; balanced.
128	~64 tokens	64	Few entries, but a 4K-token request with 65 final tokens wastes a whole tail block.

What changes inside the attention kernel

A FlashAttention-style kernel that supports paged KV adds two things to the inner loop:

# pseudo-code for the K/V tile loop inside paged FlashAttention
for j in range(num_logical_blocks):
    phys = block_table[req, j]                    # one int lookup, in shared mem
    K_tile = load(pool_K + phys * bs * Kd_bytes)  # contiguous load within block
    V_tile = load(pool_V + phys * bs * Kd_bytes)
    # … exact same online-softmax math as before …

The block_table is loaded once per request into SMEM and re-used across heads. The cost of indirection is one extra instruction per block, dwarfed by the actual K/V read. Coalescing inside a block is preserved. The kernel does not have to support arbitrary scatter — only block-granular indirection.

Sharing and copy-on-write

Because allocation is per-block, the block manager can let two requests share the same physical block when their KV is identical (e.g., a shared prompt prefix — lesson 05). If one request later diverges, the block manager performs a copy-on-write: clone the shared block into a new physical id, point that request's table at the new block, and continue. This is the foundation prefix reuse builds on.

The block manager's API

The block manager owns four operations. The attention kernel sees the results of these but never calls them directly:

Operation	What it does	When called
`allocate(req, k_blocks)`	Reserve k free physical blocks, append to req's block table.	On admission and when a sequence extends past its last block.
`free(req)`	Return req's physical blocks to the free pool.	On sequence completion, eviction, or preemption.
`share(req_a, blocks, req_b)`	Increment refcount on the listed blocks; point req_b's table at them.	On prefix-cache hit (lesson 05).
`cow(req, block_idx)`	Clone block before mutation; redirect req's table to the clone.	When a shared block is about to be written by one of its owners.

Interactive · block size accountant

Pick block size and average sequence length. The widget shows total tail waste vs block-table memory at a fixed number of active sequences. Note how both curves are convex around bs ≈ 16.

What you can now read in a config

vLLM flags like --block-size 16, --gpu-memory-utilization 0.9, --max-num-batched-tokens N, and --max-model-len all become legible:

block_size picks where on the curve you sit.
gpu_memory_utilization sets how many physical blocks the pool gets.
max_num_batched_tokens caps how much KV reading the kernel does per step.
max_model_len is a per-request upper bound on logical positions, not a contiguous reservation.

What this gives you for the rest of the track

We can now address KV at arbitrary granularity without internal fragmentation. The next questions write themselves: can we share blocks across requests when prefixes match? what data structure tracks "which prefixes have which blocks"? Lesson 05 builds exactly that, on top of the block table primitive we just installed.