PagedAttention
Apply the operating system's virtual-memory playbook to the KV cache. Fragmentation collapses; beam search and prefix sharing fall out for free.
The analogy, stated up front
An operating system does not give a process one contiguous chunk of physical RAM. It hands out fixed-size pages on demand and keeps a per-process page table mapping virtual addresses to physical frames. Two virtues fall out:
- No external fragmentation. Every free page is interchangeable, so the allocator never has to refuse a request because the holes are wrong-shaped.
- Sharing is free. Two processes can map the same physical frame into their own page tables — that's how
fork()and shared libraries work. Copy-on-write covers correctness when one of them writes.
PagedAttention is the same thing one level up. Replace "process" with "sequence", "page" with "block of BLOCK_SIZE tokens of KV", and "page table" with block_table: list[int]. Every claim about OS paging carries over verbatim.
| OS | vLLM |
|---|---|
| process | sequence |
| virtual page | logical token range (BLOCK_SIZE tokens of K, V) |
| physical frame | physical KV block in the pool |
| page table | block_table: list[int] |
fork() + CoW | beam fork / prefix share + CoW |
What contiguous KV fails at
Lesson 01 ended with the bytes-per-token formula and a fragmentation picture. Concretely the contiguous-cache approach fails twice:
1. Internal fragmentation
To serve a request you don't know the length of, you pre-allocate max_seq_len slots. If the average response uses 80 tokens but you reserved 2048, then 1968/2048 ≈ 96% of every reservation is dead. Multiply by however many concurrent sequences you wanted, and most of HBM is unused.
2. No sharing
Beam search runs N parallel candidate continuations off the same prompt. Parallel sampling with n=4 does the same. Two users hitting the same system prompt are doing the same. With contiguous per-sequence buffers, the prompt's KV is copied N times. A 4096-token system prompt at LLaMA-2 7B is ~2 GB; serve it to 4 users and you've spent 8 GB to hold 4 copies of identical bytes.
(max_seq_len − actual_len) · bytes_per_token wasted to fragmentation, plus (n_shared − 1) · prefix_bytes wasted to duplication. Both are typically dominant terms.
The block manager, in 30 lines
Two objects. The pool owns the physical blocks and refcounts them; each sequence owns a block table that points into the pool.
BLOCK_SIZE = 16 # tokens per block
class BlockPool:
def __init__(self, num_blocks):
self.free = deque(range(num_blocks)) # free list
self.refcount = [0] * num_blocks
def allocate(self):
block_id = self.free.popleft()
self.refcount[block_id] = 1
return block_id
def fork(self, block_id): # share, don't copy
self.refcount[block_id] += 1
return block_id
def release(self, block_id):
self.refcount[block_id] -= 1
if self.refcount[block_id] == 0:
self.free.append(block_id)
class Sequence:
def __init__(self, pool):
self.pool = pool
self.block_table = [] # list[int] — the page table
self.length = 0
def append(self, n=1):
for _ in range(n):
if self.length % BLOCK_SIZE == 0: # tail block full
self.block_table.append(self.pool.allocate())
self.length += 1
That's the whole mechanism. The pool's free list is a deque, the refcounts are a flat array, and a sequence's "address space" is a list[int]. No tree, no allocator, no defragmentation — because fixed-size pages don't fragment.
Logical → physical, in two divisions
The attention kernel needs to read the K, V for token t of some sequence. With a contiguous cache the address was base + t · stride. With paging it's one indirection:
kt = physical_blocks[block_id][offset]
One division, one modulo, one lookup. On the GPU that's a few extra instructions and one extra HBM read per block (not per token — the block_table is small enough to live in shared memory or registers). For a 2048-token sequence at BLOCK_SIZE = 16, the block table is 128 ints — under 1 KB. This is the entire CUDA cost of paging.
Refcounts and copy-on-write
So far this just fixes fragmentation. The second win — sharing — needs one more idea. A fork doesn't copy the parent's blocks; it copies the parent's block table and bumps each block's refcount:
def fork_from(self, parent):
self.block_table = [self.pool.fork(b) for b in parent.block_table]
self.length = parent.length
Now two sequences point at the same physical blocks. That's correct as long as nobody writes. The moment a forked sequence wants to mutate a block — i.e. append a token into a block whose refcount > 1 — we need copy-on-write:
def write_to_last_block(self):
last = self.block_table[-1]
if self.pool.refcount[last] > 1: # shared — must clone before writing
fresh = self.pool.allocate()
# memcpy(physical_blocks[fresh], physical_blocks[last])
self.pool.release(last) # drop our share of the old one
self.block_table[-1] = fresh
The protocol, said once:
- Allocate a fresh block.
- Copy the shared block's contents into it.
- Decrement the shared block's refcount (you no longer point at it).
- Point your block_table entry at the fresh block.
The unforked sibling still has the original. Both are now correct, and the copy only happened for the one block being written to — not the whole prompt.
Three places this pays off
- Beam search. N beams fork from a common prompt. They share all of it. Only when each beam writes its first own-token does its tail block get cloned — and only its tail block.
- Parallel sampling (
n=N). Identical: N samples fork a shared prefix. - Prefix sharing across requests. Two users hitting the same system prompt: hash the prompt's blocks, look them up in a global table, fork instead of allocate. That's exactly what Automatic Prefix Caching (lesson 05) is.
Beam search, numerically
A 64-token prompt with BLOCK_SIZE = 16 is 4 blocks. Four beams, each appending 10 tokens:
| strategy | blocks used | where they go |
|---|---|---|
| naive duplication | 4 beams × 4 prompt blocks = 16 | full prompt copied 4× |
| paged + sharing | 4 shared prompt blocks + 4 beams × 2 new blocks = 12 | prompt shared; each beam gets fresh blocks for its generated tokens |
Roughly 1.3× fewer blocks for this small case (12 vs 16). The savings grow with prompt length and beam count: a 256-token prompt (16 blocks) shared across 4 beams is 16 vs 64 naive — a 4× saving — and for long shared prompts the ratio approaches N.
Block size: a tradeoff, not a constant
Why 16? It's a balance of two competing costs.
So smaller blocks → less wasted memory. But:
- Each block read in the attention kernel costs one indirection. Smaller blocks mean more indirections per token, more launch overhead, less coalesced HBM access.
- The block table itself grows: a 16k-token sequence with
BLOCK_SIZE = 1has a 16k-entry table — significant on its own.
Empirically on H100/A100, 16 is the sweet spot. On future GPUs with cheaper indirection or different SRAM sizes, that number will move. The point is: it's a knob, not a law.
Interactive · drive the block pool by hand
Below is the block manager from the code, made clickable. The grid is 64 physical blocks (about as much as you could fit at LLaMA-2 7B / fp16 in 512 MB — toy scale, but the dynamics are the real ones).
Try to reproduce the beam-search scenario above:
- Click new sequence, then append 64 tokens. That's the prompt — 4 blocks, refcount 1 each.
- Click fork ×3 on it to produce 4 beams total. Refcounts on those 4 blocks should be 4.
- Forks share all 4 prompt blocks (refcount goes 1→4 on each). The first append on a fork at exactly the block boundary just allocates a new private block; the shared prompt blocks stay shared. To see CoW, click write to last block on a fork — that mutates the shared tail and triggers CoW (allocate fresh, copy contents, swap pointer, decrement old refcount).
- Compare blocks used against the naive prediction of 4 beams × 4 prompt blocks = 16. With sharing you should see 12 (4 shared prompt + 8 per-beam new).
The CUDA picture, briefly
vLLM's paged_attention kernel does the gather inline with the dot product. Per attention head, per query:
for block_idx in range(num_blocks_for_this_seq):
physical_block = block_table[block_idx] // 1 HBM read of an int
K_block = phys_K[physical_block] // gather 16 × d
V_block = phys_V[physical_block]
// … same online-softmax math as FlashAttention (lesson 03)
The block_table is tiny and almost always cached. The phys_K[...] gather costs no more than the original contiguous read — both are bandwidth-bound on the same number of bytes. Paged attention is not slower than contiguous attention; it's the same kernel with a different addressing mode.
What you should walk away with
BLOCK_SIZE − 1 tail problem (negligible) and makes sharing a refcount problem (free). Beam search and prefix caching aren't features bolted on top — they're consequences of the same data structure. The CUDA cost is one indirection per block.
The next lesson is FlashAttention — a completely orthogonal optimization that attacks the compute side of the same attention op. People confuse the two constantly. They compose: vLLM ships a kernel that does Flash-style tiling over a Paged-style block layout.