all_lessons/gpu_kernel_serving/13 · paged KVlesson 13 / 17

Paged KV: virtual memory for caches

Per-request contiguous KV buffers waste most of HBM. Treating KV like OS virtual memory — fixed-size physical pages addressed through a per-request page table — turns wasted GB back into useful capacity, at the cost of one indirection in the attention kernel.

The question this lesson answers

Lesson 02 derived that one Llama-3-70B request at 8K context owns 2.56 MB of KV. A naive engine sizes each request's KV buffer to max_context · per_token_bytes so attention can read it linearly. With max_context = 32K and 100 active sequences, that pre-allocates ~32 GB whether or not those sequences ever reach 32K — and on a 80 GB H100 that is the difference between serving 200 requests and 20. Most of the reserved bytes are never used. PagedAttention solves this by abandoning per-request contiguity.

The fragmentation problem with numbers

Suppose four requests with current lengths 800, 1500, 3000, and 5500 share an engine configured for a 32K max. Naive contiguous allocation reserves 4 × 32K = 128K tokens of KV regardless of actual use. Actual use is 800 + 1500 + 3000 + 5500 = 10,800 tokens. Internal fragmentation: 91.6%. The 91.6% is not "waste due to bad packing"; it's "waste because the engine cannot predict the final length and reserves for the worst case."

contiguous, one buffer per request solid = used, dashed = reserved-but-unused (tail). Most of each request's slot is empty. paged: fixed-size physical blocks, allocated on demand colored squares = blocks owned by request 1/2/3/4; dashed = free pool, available to whichever request asks next. Tail waste per request is now bounded by < block_size tokens, not max_context. Trade-off: the attention kernel must follow a per-request block table instead of a single pointer.

What a block table actually is

For each request, the engine maintains a small integer array — the block table — that maps logical block index (token range) to a physical block id in a global KV pool. Reading position t means:

  1. logical_block = t // block_size
  2. physical_block = block_table[req][logical_block]
  3. kv_addr = pool_base + physical_block · block_size · per_token_bytes + (t % block_size) · per_token_bytes

The attention kernel does this lookup for every K and V it loads. Crucially, within a block the addresses are still contiguous, so coalesced reads work fine — the indirection only happens at block boundaries.

request KV (logical) tokens 0–15 (block 0) tokens 16–31 (block 1) tokens 32–47 (block 2) tokens 48–63 (block 3) tokens 64–79 (block 4) unallocated block_table int[5] [0] → 03 [1] → 04 [2] → 09 [3] → 15 [4] → 11 global KV pool (physical blocks) 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 colored cells are owned by this request (ids 03, 04, 09, 15, 11). Physically scattered, logically contiguous.

Block size is the central trade-off

Block size bs (tokens per page) trades three quantities:

Real systems usually settle at bs ∈ {8, 16, 32}. vLLM defaults to 16. The choice also has to satisfy backend constraints — e.g., some FlashInfer paths assume a specific page size and the engine will reject incompatible combinations rather than running silently slow.

Block sizeTail waste (avg)Table entries at T=8KTrade-off feel
1~08192Maximum packing, maximum indexing overhead.
16~8 tokens512vLLM default; balanced.
128~64 tokens64Few entries, but a 4K-token request with 65 final tokens wastes a whole tail block.

What changes inside the attention kernel

A FlashAttention-style kernel that supports paged KV adds two things to the inner loop:

# pseudo-code for the K/V tile loop inside paged FlashAttention
for j in range(num_logical_blocks):
    phys = block_table[req, j]                    # one int lookup, in shared mem
    K_tile = load(pool_K + phys * bs * Kd_bytes)  # contiguous load within block
    V_tile = load(pool_V + phys * bs * Kd_bytes)
    # … exact same online-softmax math as before …

The block_table is loaded once per request into SMEM and re-used across heads. The cost of indirection is one extra instruction per block, dwarfed by the actual K/V read. Coalescing inside a block is preserved. The kernel does not have to support arbitrary scatter — only block-granular indirection.

Sharing and copy-on-write
Because allocation is per-block, the block manager can let two requests share the same physical block when their KV is identical (e.g., a shared prompt prefix — lesson 05). If one request later diverges, the block manager performs a copy-on-write: clone the shared block into a new physical id, point that request's table at the new block, and continue. This is the foundation prefix reuse builds on.

The block manager's API

The block manager owns four operations. The attention kernel sees the results of these but never calls them directly:

OperationWhat it doesWhen called
allocate(req, k_blocks)Reserve k free physical blocks, append to req's block table.On admission and when a sequence extends past its last block.
free(req)Return req's physical blocks to the free pool.On sequence completion, eviction, or preemption.
share(req_a, blocks, req_b)Increment refcount on the listed blocks; point req_b's table at them.On prefix-cache hit (lesson 05).
cow(req, block_idx)Clone block before mutation; redirect req's table to the clone.When a shared block is about to be written by one of its owners.

Interactive · block size accountant

Pick block size and average sequence length. The widget shows total tail waste vs block-table memory at a fixed number of active sequences. Note how both curves are convex around bs ≈ 16.

Paged KV block size trade-off

Small blocks: less tail, more table. Large blocks: less table, more tail. The right answer also depends on backend constraints.

What you can now read in a config

vLLM flags like --block-size 16, --gpu-memory-utilization 0.9, --max-num-batched-tokens N, and --max-model-len all become legible:

What this gives you for the rest of the track

We can now address KV at arbitrary granularity without internal fragmentation. The next questions write themselves: can we share blocks across requests when prefixes match? what data structure tracks "which prefixes have which blocks"? Lesson 05 builds exactly that, on top of the block table primitive we just installed.