vllm_lessons / 10 · chunked prefill lesson 10 / 12

Chunked prefill

Prefill is compute-bound; decode is memory-bound. Serialize them and you saturate one bottleneck at a time. Pack them in one fused pass and you fire both engines simultaneously.

The bottleneck table you already know

Lesson 01 noted, and lesson 08 dramatized, that the two phases of inference have opposite limits:

phaseFLOPs / stepHBM bytes / stepbottleneck
prefill (T tokens)O(T2) per layerO(T) for weights + activationscompute
decode (1 token)O(t) per layerO(t) — full KV readmemory

Run them on the same GPU back-to-back: while prefill is grinding through matmuls, the HBM bandwidth lane is sitting at 10% utilization. While the in-flight decoders are streaming KV out of HBM, the tensor cores are sitting at 5% utilization. You bought the GPU; you're using half of it.

The idea

Two moves:

  1. Break each prefill into fixed-size chunks (say 512 tokens). The math is unchanged — attention still attends within the prompt — but the work is now in 512-token bites instead of one 4096-token block.
  2. Within a single fused forward pass, pack a mix of (decode-step tokens, prefill-chunk tokens) up to a token budget. One step, one kernel launch, two workloads.

Because prefill chunks are FLOP-heavy and decode steps are HBM-heavy, batching them complements the GPU's two scarce resources. The two phases hide each other's latency in the same step.

Why this works · the arithmetic

Let a step process N tokens total. The kernel's wall time is roughly:

tstep  ≈  max(N · ccompute,   KV_bytes_read · cHBM)  +  launch_overhead

For a pure decode step: N is small (1 per active sequence), the right term dominates. For a pure prefill of a long prompt: N is large, the left term dominates. Stack a prefill chunk on top of Ndecode decode tokens in one step: HBM bandwidth runs at near 100% on behalf of the decoders, and the same tensor cores run at near 100% on behalf of the prefill chunk. The two costs are added in a max, not a sum.

Scheduler, in 12 lines

budget_left = TOKEN_BUDGET           # e.g. 2048

# 1. Decodes go first — each needs exactly 1 slot, none can be split.
tasks = []
for r in active_decoders:
    tasks.append(DecodeTask(r, 1))
    budget_left -= 1

# 2. Fill remaining budget with chunks from waiting prefills.
for r in waiting_prefills:
    if budget_left == 0: break
    chunk = min(budget_left, r.prefill_left)
    tasks.append(PrefillTask(r, chunk))
    r.prefill_left -= chunk
    budget_left -= chunk
    if r.prefill_left == 0:
        r.move_to_decoding()       # next step, this request joins active_decoders

run_fused_forward(tasks)

That's the entire algorithm. Real vLLM adds priorities, SLO awareness, and per-sequence limits, but the skeleton is right.

The fused kernel shape

Tasks are concatenated along a flat token dim:

X:           (sum_of_task_tokens, d)        # packed input embeddings

# Per-task metadata, length = n_tasks
seq_start:   (n_tasks,)                     # where each task starts in X
seq_len:     (n_tasks,)                     # how many tokens this task owns
block_table: list of block-id arrays        # paged KV (see lesson 02)

Inside the attention op each task dispatches its tokens against its own KV: past blocks for ongoing decodes, past chunks + current chunk for in-progress prefills. Attention math is unchanged — only the gather pattern differs, and the paged-KV block table makes that trivial.

Interactive · zoom into one fused step

The picture below is one kernel launch. The wide bar is the token budget; each cell is a token. Blue cells are decode tokens — one per active decoder, each reading its own full KV cache from HBM. Colored stretches are prefill chunks, sliced off waiting requests. The dim tail is wasted budget.

Move the sliders: see how the compute lane (filled by the count of tokens) and the HBM lane (filled mostly by the count of active decoders reading their full KV) both light up only when the step is packed with both kinds of work.

A single fused-forward step · token packing
Decode = 1 cell per active sequence (HBM-heavy). Prefill chunk = many cells from one request (compute-heavy). Goal: fill both lanes.
tokens packed
budget remaining
compute lane
HBM lane
show the packing rule
// One step:
budget_left = TOKEN_BUDGET
batch = []                       # what the kernel sees

# decodes first — each owns exactly one slot
for r in active_decoders:
    batch.append((r, "decode", 1))
    budget_left -= 1

# fill remainder with prefill chunks, in order
for r in waiting_prefills:
    if budget_left == 0: break
    n = min(chunk_size, r.prefill_left, budget_left)
    batch.append((r, "prefill", n))
    budget_left -= n
return run_fused_forward(batch)   # one kernel call

The budget tradeoff

The single tunable is max_num_batched_tokens. Two failure modes:

Budget too small (< 512)
Per-step wall time is dominated by the kernel launch / Python overhead, not the per-token work. Throughput tanks because you spend more time launching kernels than computing. Same disease as static batching at batch=1.
Budget too large (> 16k)
Each step takes long enough that ongoing decodes see a TTBT (time-between-tokens) spike during a step that contains a huge prefill chunk. Tail latency goes up; mean throughput doesn't improve because you're already compute-saturated. Visible to users as "the stream jitters".

Empirical sweet spot: 2048–8192 on H100/A100 for 7B–70B models. The widget below lets you feel the cliff.

TTFT impact · why this is a big deal for long prompts

Time to first token, by definition, is "how long after I send the request do I get token 1". In the plain regime:

TTFTplain  =  (queue wait)  +  prefill_time(P)  +  one decode step

For P = 4096 on a 7B at ~10 GB/s prefill: prefill_time ≈ 100 ms. The new request waits the full 100 ms before seeing anything.

With chunked prefill at budget 512:

TTFTchunked  ≈  queue wait  +  one chunk_time  +  one decode step

That's ~13 ms for the chunk vs the 100 ms full prefill — an ~8× TTFT win on a 4096-tok prompt. For short prompts (P ≤ chunk_size) it's a wash; the chunk is the prefill.

Caveats and what doesn't change

Interactive · pack the budget

Animate the scheduler for 20 steps on a workload of mixed prompt lengths. Each step is a horizontal bar showing how that step's budget tokens were spent: blue for decode tokens (one per active decoder), orange for prefill chunks, dim for unused budget. In plain mode prefills consume the entire budget in one step; in chunked mode the bars fill consistently.

Per-step budget filler
Watch what happens to the blue decode lane when a long prefill arrives in plain mode. With chunked mode the same prefill spreads across many steps, leaving room for decodes every step.
tip: try plain mode with budget=2048 and a long-prompt arrival pattern — watch the blue lane vanish for a whole step
mean TTFT
total wall time
GPU util (filled/budget)
decodes served
show the scheduler this widget simulates
// At each step:
//   budget_left = TOKEN_BUDGET
//   for each active decoder: spend 1 token, budget_left -= 1
//   for each waiting prefill (mode=chunked):
//     chunk = min(prefill_left, budget_left)
//     budget_left -= chunk
//     if prefill_left == 0: promote to decoder
//
//   plain mode: budget gets eaten whole by ONE waiting prefill,
//   and the active decoders sit idle for that step.

Putting it together

regimecompute laneHBM laneGPU utilTTFT for new long-prompt req
serialized (plain)full during prefill, idle during decodeidle during prefill, full during decode~50%prefill_time
chunked + co-batchedfull each step (chunk fills FLOPs)full each step (decoders fill bandwidth)~85%1 chunk_time
Takeaway
Chunked prefill is the scheduling primitive that turns "two phases with opposite bottlenecks" from a problem into an asset. Same kernel, same model — every step now uses both halves of the GPU. Net throughput gain on mixed workloads is 1.5–2×; TTFT on long prompts improves linearly with the chunk count.