PagedAttention

Apply the operating system's virtual-memory playbook to the KV cache. Fragmentation collapses; beam search and prefix sharing fall out for free.

The analogy, stated up front

An operating system does not give a process one contiguous chunk of physical RAM. It hands out fixed-size pages on demand and keeps a per-process page table mapping virtual addresses to physical frames. Two virtues fall out:

No external fragmentation. Every free page is interchangeable, so the allocator never has to refuse a request because the holes are wrong-shaped.
Sharing is free. Two processes can map the same physical frame into their own page tables — that's how fork() and shared libraries work. Copy-on-write covers correctness when one of them writes.

PagedAttention is the same thing one level up. Replace "process" with "sequence", "page" with "block of BLOCK_SIZE tokens of KV", and "page table" with block_table: list[int]. Every claim about OS paging carries over verbatim.

Mapping

OS	vLLM
process	sequence
virtual page	logical token range (BLOCK_SIZE tokens of K, V)
physical frame	physical KV block in the pool
page table	`block_table: list[int]`
`fork()` + CoW	beam fork / prefix share + CoW

What contiguous KV fails at

Lesson 01 ended with the bytes-per-token formula and a fragmentation picture. Concretely the contiguous-cache approach fails twice:

1. Internal fragmentation

To serve a request you don't know the length of, you pre-allocate max_seq_len slots. If the average response uses 80 tokens but you reserved 2048, then 1968/2048 ≈ 96% of every reservation is dead. Multiply by however many concurrent sequences you wanted, and most of HBM is unused.

2. No sharing

Beam search runs N parallel candidate continuations off the same prompt. Parallel sampling with n=4 does the same. Two users hitting the same system prompt are doing the same. With contiguous per-sequence buffers, the prompt's KV is copied N times. A 4096-token system prompt at LLaMA-2 7B is ~2 GB; serve it to 4 users and you've spent 8 GB to hold 4 copies of identical bytes.

The cost in one line

Per sequence: up to (max_seq_len − actual_len) · bytes_per_token wasted to fragmentation, plus (n_shared − 1) · prefix_bytes wasted to duplication. Both are typically dominant terms.

The block manager, in 30 lines

Two objects. The pool owns the physical blocks and refcounts them; each sequence owns a block table that points into the pool.

BLOCK_SIZE = 16  # tokens per block

class BlockPool:
    def __init__(self, num_blocks):
        self.free     = deque(range(num_blocks))   # free list
        self.refcount = [0] * num_blocks

    def allocate(self):
        block_id = self.free.popleft()
        self.refcount[block_id] = 1
        return block_id

    def fork(self, block_id):                      # share, don't copy
        self.refcount[block_id] += 1
        return block_id

    def release(self, block_id):
        self.refcount[block_id] -= 1
        if self.refcount[block_id] == 0:
            self.free.append(block_id)

class Sequence:
    def __init__(self, pool):
        self.pool        = pool
        self.block_table = []        # list[int] — the page table
        self.length      = 0

    def append(self, n=1):
        for _ in range(n):
            if self.length % BLOCK_SIZE == 0:               # tail block full
                self.block_table.append(self.pool.allocate())
            self.length += 1

That's the whole mechanism. The pool's free list is a deque, the refcounts are a flat array, and a sequence's "address space" is a list[int]. No tree, no allocator, no defragmentation — because fixed-size pages don't fragment.

Logical → physical, in two divisions

The attention kernel needs to read the K, V for token t of some sequence. With a contiguous cache the address was base + t · stride. With paging it's one indirection:

block_id = block_table[t // BLOCK_SIZE] offset = t mod BLOCK_SIZE
k_t = physical_blocks[block_id][offset]

One division, one modulo, one lookup. On the GPU that's a few extra instructions and one extra HBM read per block (not per token — the block_table is small enough to live in shared memory or registers). For a 2048-token sequence at BLOCK_SIZE = 16, the block table is 128 ints — under 1 KB. This is the entire CUDA cost of paging.

Why this isn't slower

Attention is memory-bound (lesson 01, consequence B). The kernel was already waiting on HBM to deliver K and V. A single extra indirection per block — amortized over 16 token reads — is below the noise floor. vLLM measured the overhead at ~0–3%, and the win from fitting 4–10× more sequences in HBM dwarfs it.

Refcounts and copy-on-write

So far this just fixes fragmentation. The second win — sharing — needs one more idea. A fork doesn't copy the parent's blocks; it copies the parent's block table and bumps each block's refcount:

def fork_from(self, parent):
    self.block_table = [self.pool.fork(b) for b in parent.block_table]
    self.length      = parent.length

Now two sequences point at the same physical blocks. That's correct as long as nobody writes. The moment a forked sequence wants to mutate a block — i.e. append a token into a block whose refcount > 1 — we need copy-on-write:

def write_to_last_block(self):
    last = self.block_table[-1]
    if self.pool.refcount[last] > 1:        # shared — must clone before writing
        fresh = self.pool.allocate()
        # memcpy(physical_blocks[fresh], physical_blocks[last])
        self.pool.release(last)             # drop our share of the old one
        self.block_table[-1] = fresh

The protocol, said once:

Allocate a fresh block.
Copy the shared block's contents into it.
Decrement the shared block's refcount (you no longer point at it).
Point your block_table entry at the fresh block.

The unforked sibling still has the original. Both are now correct, and the copy only happened for the one block being written to — not the whole prompt.

Three places this pays off

Beam search. N beams fork from a common prompt. They share all of it. Only when each beam writes its first own-token does its tail block get cloned — and only its tail block.
Parallel sampling (n=N). Identical: N samples fork a shared prefix.
Prefix sharing across requests. Two users hitting the same system prompt: hash the prompt's blocks, look them up in a global table, fork instead of allocate. That's exactly what Automatic Prefix Caching (lesson 05) is.

Beam search, numerically

A 64-token prompt with BLOCK_SIZE = 16 is 4 blocks. Four beams, each appending 10 tokens:

strategy	blocks used	where they go
naive duplication	4 beams × 4 prompt blocks = 16	full prompt copied 4×
paged + sharing	4 shared prompt blocks + 4 beams × 2 new blocks = 12	prompt shared; each beam gets fresh blocks for its generated tokens

Roughly 1.3× fewer blocks for this small case (12 vs 16). The savings grow with prompt length and beam count: a 256-token prompt (16 blocks) shared across 4 beams is 16 vs 64 naive — a 4× saving — and for long shared prompts the ratio approaches N.

Block size: a tradeoff, not a constant

Why 16? It's a balance of two competing costs.

E[tail waste per sequence] = BLOCK_SIZE / 2 (uniform over the last block)

So smaller blocks → less wasted memory. But:

Each block read in the attention kernel costs one indirection. Smaller blocks mean more indirections per token, more launch overhead, less coalesced HBM access.
The block table itself grows: a 16k-token sequence with BLOCK_SIZE = 1 has a 16k-entry table — significant on its own.

Empirically on H100/A100, 16 is the sweet spot. On future GPUs with cheaper indirection or different SRAM sizes, that number will move. The point is: it's a knob, not a law.

Interactive · drive the block pool by hand

Below is the block manager from the code, made clickable. The grid is 64 physical blocks (about as much as you could fit at LLaMA-2 7B / fp16 in 512 MB — toy scale, but the dynamics are the real ones).

Try to reproduce the beam-search scenario above:

Click new sequence, then append 64 tokens. That's the prompt — 4 blocks, refcount 1 each.
Click fork ×3 on it to produce 4 beams total. Refcounts on those 4 blocks should be 4.
Forks share all 4 prompt blocks (refcount goes 1→4 on each). The first append on a fork at exactly the block boundary just allocates a new private block; the shared prompt blocks stay shared. To see CoW, click write to last block on a fork — that mutates the shared tail and triggers CoW (allocate fresh, copy contents, swap pointer, decrement old refcount).
Compare blocks used against the naive prediction of 4 beams × 4 prompt blocks = 16. With sharing you should see 12 (4 shared prompt + 8 per-beam new).

BlockPool simulator · 64 blocks, BLOCK_SIZE = 16

Each cell is one physical block. The colored bar at the top of a cell shows which sequences (by hue) hold a share. The number in the corner is the refcount. Free blocks are dim.

block size = 16 tokens · pool = 64 blocks

free / total

64 / 64

used blocks

shared blocks

sequences

show the model this widget simulates

// The whole simulator, verbatim from 02_paged_attention.py:
class BlockPool:
    allocate()      → take a free block, refcount := 1
    fork(b)         → refcount[b] += 1   (share)
    release(b)      → refcount[b] -= 1   (release; if 0, return to free list)

class Sequence:
    append(n)       → grow length; allocate a new tail block whenever length % 16 == 0
    fork_from(p)    → block_table := [pool.fork(b) for b in p.block_table]
    write_to_last() → if refcount[tail] > 1: allocate fresh, release old, swap

The CUDA picture, briefly

vLLM's paged_attention kernel does the gather inline with the dot product. Per attention head, per query:

for block_idx in range(num_blocks_for_this_seq):
    physical_block = block_table[block_idx]               // 1 HBM read of an int
    K_block = phys_K[physical_block]                      // gather 16 × d
    V_block = phys_V[physical_block]
    // … same online-softmax math as FlashAttention (lesson 03)

The block_table is tiny and almost always cached. The phys_K[...] gather costs no more than the original contiguous read — both are bandwidth-bound on the same number of bytes. Paged attention is not slower than contiguous attention; it's the same kernel with a different addressing mode.

What you should walk away with

Takeaway

Paging the KV cache makes fragmentation a BLOCK_SIZE − 1 tail problem (negligible) and makes sharing a refcount problem (free). Beam search and prefix caching aren't features bolted on top — they're consequences of the same data structure. The CUDA cost is one indirection per block.

The next lesson is FlashAttention — a completely orthogonal optimization that attacks the compute side of the same attention op. People confuse the two constantly. They compose: vLLM ships a kernel that does Flash-style tiling over a Paged-style block layout.