vllm_lessons / 02 · PagedAttention lesson 2 / 12

PagedAttention

Apply the operating system's virtual-memory playbook to the KV cache. Fragmentation collapses; beam search and prefix sharing fall out for free.

The analogy, stated up front

An operating system does not give a process one contiguous chunk of physical RAM. It hands out fixed-size pages on demand and keeps a per-process page table mapping virtual addresses to physical frames. Two virtues fall out:

PagedAttention is the same thing one level up. Replace "process" with "sequence", "page" with "block of BLOCK_SIZE tokens of KV", and "page table" with block_table: list[int]. Every claim about OS paging carries over verbatim.

Mapping
OSvLLM
processsequence
virtual pagelogical token range (BLOCK_SIZE tokens of K, V)
physical framephysical KV block in the pool
page tableblock_table: list[int]
fork() + CoWbeam fork / prefix share + CoW

What contiguous KV fails at

Lesson 01 ended with the bytes-per-token formula and a fragmentation picture. Concretely the contiguous-cache approach fails twice:

1. Internal fragmentation

To serve a request you don't know the length of, you pre-allocate max_seq_len slots. If the average response uses 80 tokens but you reserved 2048, then 1968/2048 ≈ 96% of every reservation is dead. Multiply by however many concurrent sequences you wanted, and most of HBM is unused.

2. No sharing

Beam search runs N parallel candidate continuations off the same prompt. Parallel sampling with n=4 does the same. Two users hitting the same system prompt are doing the same. With contiguous per-sequence buffers, the prompt's KV is copied N times. A 4096-token system prompt at LLaMA-2 7B is ~2 GB; serve it to 4 users and you've spent 8 GB to hold 4 copies of identical bytes.

The cost in one line
Per sequence: up to (max_seq_len − actual_len) · bytes_per_token wasted to fragmentation, plus (n_shared − 1) · prefix_bytes wasted to duplication. Both are typically dominant terms.

The block manager, in 30 lines

Two objects. The pool owns the physical blocks and refcounts them; each sequence owns a block table that points into the pool.

BLOCK_SIZE = 16  # tokens per block

class BlockPool:
    def __init__(self, num_blocks):
        self.free     = deque(range(num_blocks))   # free list
        self.refcount = [0] * num_blocks

    def allocate(self):
        block_id = self.free.popleft()
        self.refcount[block_id] = 1
        return block_id

    def fork(self, block_id):                      # share, don't copy
        self.refcount[block_id] += 1
        return block_id

    def release(self, block_id):
        self.refcount[block_id] -= 1
        if self.refcount[block_id] == 0:
            self.free.append(block_id)

class Sequence:
    def __init__(self, pool):
        self.pool        = pool
        self.block_table = []        # list[int] — the page table
        self.length      = 0

    def append(self, n=1):
        for _ in range(n):
            if self.length % BLOCK_SIZE == 0:               # tail block full
                self.block_table.append(self.pool.allocate())
            self.length += 1

That's the whole mechanism. The pool's free list is a deque, the refcounts are a flat array, and a sequence's "address space" is a list[int]. No tree, no allocator, no defragmentation — because fixed-size pages don't fragment.

Logical → physical, in two divisions

The attention kernel needs to read the K, V for token t of some sequence. With a contiguous cache the address was base + t · stride. With paging it's one indirection:

block_id = block_table[t // BLOCK_SIZE]    offset = t mod BLOCK_SIZE
kt = physical_blocks[block_id][offset]

One division, one modulo, one lookup. On the GPU that's a few extra instructions and one extra HBM read per block (not per token — the block_table is small enough to live in shared memory or registers). For a 2048-token sequence at BLOCK_SIZE = 16, the block table is 128 ints — under 1 KB. This is the entire CUDA cost of paging.

Why this isn't slower
Attention is memory-bound (lesson 01, consequence B). The kernel was already waiting on HBM to deliver K and V. A single extra indirection per block — amortized over 16 token reads — is below the noise floor. vLLM measured the overhead at ~0–3%, and the win from fitting 4–10× more sequences in HBM dwarfs it.

Refcounts and copy-on-write

So far this just fixes fragmentation. The second win — sharing — needs one more idea. A fork doesn't copy the parent's blocks; it copies the parent's block table and bumps each block's refcount:

def fork_from(self, parent):
    self.block_table = [self.pool.fork(b) for b in parent.block_table]
    self.length      = parent.length

Now two sequences point at the same physical blocks. That's correct as long as nobody writes. The moment a forked sequence wants to mutate a block — i.e. append a token into a block whose refcount > 1 — we need copy-on-write:

def write_to_last_block(self):
    last = self.block_table[-1]
    if self.pool.refcount[last] > 1:        # shared — must clone before writing
        fresh = self.pool.allocate()
        # memcpy(physical_blocks[fresh], physical_blocks[last])
        self.pool.release(last)             # drop our share of the old one
        self.block_table[-1] = fresh

The protocol, said once:

  1. Allocate a fresh block.
  2. Copy the shared block's contents into it.
  3. Decrement the shared block's refcount (you no longer point at it).
  4. Point your block_table entry at the fresh block.

The unforked sibling still has the original. Both are now correct, and the copy only happened for the one block being written to — not the whole prompt.

Three places this pays off

Beam search, numerically

A 64-token prompt with BLOCK_SIZE = 16 is 4 blocks. Four beams, each appending 10 tokens:

strategyblocks usedwhere they go
naive duplication4 beams × 4 prompt blocks = 16full prompt copied 4×
paged + sharing4 shared prompt blocks + 4 beams × 2 new blocks = 12prompt shared; each beam gets fresh blocks for its generated tokens

Roughly 1.3× fewer blocks for this small case (12 vs 16). The savings grow with prompt length and beam count: a 256-token prompt (16 blocks) shared across 4 beams is 16 vs 64 naive — a 4× saving — and for long shared prompts the ratio approaches N.

Block size: a tradeoff, not a constant

Why 16? It's a balance of two competing costs.

E[tail waste per sequence] = BLOCK_SIZE / 2    (uniform over the last block)

So smaller blocks → less wasted memory. But:

Empirically on H100/A100, 16 is the sweet spot. On future GPUs with cheaper indirection or different SRAM sizes, that number will move. The point is: it's a knob, not a law.

Interactive · drive the block pool by hand

Below is the block manager from the code, made clickable. The grid is 64 physical blocks (about as much as you could fit at LLaMA-2 7B / fp16 in 512 MB — toy scale, but the dynamics are the real ones).

Try to reproduce the beam-search scenario above:

  1. Click new sequence, then append 64 tokens. That's the prompt — 4 blocks, refcount 1 each.
  2. Click fork ×3 on it to produce 4 beams total. Refcounts on those 4 blocks should be 4.
  3. Forks share all 4 prompt blocks (refcount goes 1→4 on each). The first append on a fork at exactly the block boundary just allocates a new private block; the shared prompt blocks stay shared. To see CoW, click write to last block on a fork — that mutates the shared tail and triggers CoW (allocate fresh, copy contents, swap pointer, decrement old refcount).
  4. Compare blocks used against the naive prediction of 4 beams × 4 prompt blocks = 16. With sharing you should see 12 (4 shared prompt + 8 per-beam new).
BlockPool simulator · 64 blocks, BLOCK_SIZE = 16
Each cell is one physical block. The colored bar at the top of a cell shows which sequences (by hue) hold a share. The number in the corner is the refcount. Free blocks are dim.
block size = 16 tokens · pool = 64 blocks
free / total
64 / 64
used blocks
0
shared blocks
0
sequences
0
show the model this widget simulates
// The whole simulator, verbatim from 02_paged_attention.py:
class BlockPool:
    allocate()      → take a free block, refcount := 1
    fork(b)         → refcount[b] += 1   (share)
    release(b)      → refcount[b] -= 1   (release; if 0, return to free list)

class Sequence:
    append(n)       → grow length; allocate a new tail block whenever length % 16 == 0
    fork_from(p)    → block_table := [pool.fork(b) for b in p.block_table]
    write_to_last() → if refcount[tail] > 1: allocate fresh, release old, swap

The CUDA picture, briefly

vLLM's paged_attention kernel does the gather inline with the dot product. Per attention head, per query:

for block_idx in range(num_blocks_for_this_seq):
    physical_block = block_table[block_idx]               // 1 HBM read of an int
    K_block = phys_K[physical_block]                      // gather 16 × d
    V_block = phys_V[physical_block]
    // … same online-softmax math as FlashAttention (lesson 03)

The block_table is tiny and almost always cached. The phys_K[...] gather costs no more than the original contiguous read — both are bandwidth-bound on the same number of bytes. Paged attention is not slower than contiguous attention; it's the same kernel with a different addressing mode.

What you should walk away with

Takeaway
Paging the KV cache makes fragmentation a BLOCK_SIZE − 1 tail problem (negligible) and makes sharing a refcount problem (free). Beam search and prefix caching aren't features bolted on top — they're consequences of the same data structure. The CUDA cost is one indirection per block.

The next lesson is FlashAttention — a completely orthogonal optimization that attacks the compute side of the same attention op. People confuse the two constantly. They compose: vLLM ships a kernel that does Flash-style tiling over a Paged-style block layout.