Continuous batching

Static batching wastes half your cycles waiting for stragglers. The fix is one paragraph of scheduler logic — but it only works because of paged KV.

The picture you inherited from training

Training-style inference takes a batch of B sequences, runs them in lockstep for T steps, returns the lot, takes the next batch. The batched matmul is efficient — that's the whole reason batches exist. So the natural mental model when you first deploy a model is "pack B requests, run them together, pop them off when they're done."

This works in training because every example in a batch consumes exactly the same number of forward steps: one. You backprop, you're done. Inference is different. A request emits one token per step, and each request has its own unknown stopping time — the model decides when to emit <eos>. The batch no longer terminates at a single time T; it terminates at max_i(T_i).

Which means: the batch runs for as long as its longest member. Every other slot, after it emits its own <eos>, sits there generating pad tokens that get discarded. The GPU still does the work — the kernel is shape-fixed, it doesn't know which sequences are "done" — it's just thrown away.

The static-batching cost, exactly

Let a batch have output lengths T₁, …, T_B. Cycles spent = max_i(T_i). Useful tokens emitted = Σ T_i. Utilization is

η_static = (Σ T_i) / (B · max_i T_i) = mean(T_i) / max(T_i)

This ratio is what determines whether you're getting 90% or 40% of the kernel's peak. It has nothing to do with the kernel.

Why production workloads punish this

Chat-style traffic is bimodal: most answers are short ("yes", "the capital is Paris", a one-line correction), a minority are long (a code block, an essay, a generated test plan). Empirically: ~80% of completions land between 10 and 50 tokens, ~20% in the 200–500 range. The mean is small, the max is large, and they're separated by an order of magnitude.

Plug a representative batch (50, 60, 400, 80) into the formula:

η = (50 + 60 + 400 + 80) / (4 · 400) = 590 / 1600 = 36.9%

For 313 out of 400 steps (78%), three of the four slots are doing nothing. The slot stays allocated, the KV grows for it on the device side (because the kernel is shape-fixed), and the bytes still move through HBM. You paid for four sequences and got 1.5 worth of useful output.

step  ────────────────────────────────────────────  400
A:50  ████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  (waiting 350 steps for the batch)
B:60  █████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  (waiting 340)
C:400 ████████████████████████████████████████████  (the straggler)
D:80  ███████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  (waiting 320)
                                  ▲
                                  └── gray = wasted cycle, full HBM traffic, zero useful work

And it gets worse. A new request that arrives at step 5 can't join this batch — the kernel shapes are fixed. It waits until step 400 for the batch to drain, then waits for B-1 other requests to accumulate, then runs. Time-to-first-token is the batch's predecessor's straggler.

Continuous batching (Orca, OSDI 2022)

The fix is to stop treating the batch as a fixed-membership unit and instead make membership a per-iteration decision. The loop is exactly what the static loop should have been:

while pending or active:
    # 1. Evict: any sequence whose last token was <eos> leaves the batch.
    active = [r for r in active if not r.done]

    # 2. Admit: pull pending requests from the queue until we hit max_active
    #    (or the KV-cache budget — whichever bites first).
    while pending and len(active) < max_active:
        active.append(pending.pop(0))

    # 3. Run one forward step over the (now-updated) active set.
    forward(active)

The kernel's batch dimension changes every step. A request that emitted <eos> on step k is gone on step k+1; a request that arrived in the queue at step k joins on step k+1. Nobody waits for stragglers, because there are no stragglers — only a continuously-refilled active set.

The utilization changes:

η_continuous = (Σ T_i) / (Σ T_i) = 1

(modulo edge effects at the head and tail of the schedule, and modulo the KV budget). Every cycle produces |active| useful tokens. The 400-step straggler still takes 400 steps, but it shares those 400 steps with 7 or 15 or however-many different short requests that come and go around it.

What changed

Same model, same kernels, same hardware. Only the scheduler. Throughput on bimodal workloads goes from ~45% of peak to 85–95%.

The link back to lesson 02 — why this needed paged KV

Step 2 of the loop says "admit pending requests up to max_active." Pretend for a moment we're still on the textbook contiguous KV cache from lesson 01. To admit a new request right now, mid-step, we'd need to find a contiguous range of max_seq_len slots in HBM for its KV.

After even a little churn — sequences ending, new ones starting, the active set turning over every few steps — that contiguous range doesn't exist. The buffer is a sea of holes: a 50-token gap here, a 60-token gap there, a 400-token survivor in the middle. Classic external fragmentation. We'd have to either copy-compact the cache (a stop-the-world operation that blows the whole point of avoiding stragglers) or refuse to admit (back to waiting).

With paged KV the problem evaporates. Admitting a new request means: grab as many free blocks as it needs right now from the shared pool, point its block table at them. Each block is 16 tokens; you only need some free blocks, not a contiguous run. The block manager hands them out, the new sequence's first row gets written into the first block, and on the next forward pass the attention kernel reads through the block table indirection like any other sequence.

The dependency, in one line

Paged KV makes admit O(blocks_needed) instead of O(scan for contiguous range). Without it, "iteration-level scheduling" is a paper, not a system.

This is also the answer to "what's the actual core innovation of vLLM?" PagedAttention is the paper headline, but its practical value is that it unlocks continuous batching. Either one alone is interesting; together they're the throughput story.

Prefill vs decode, and a foreshadow

The simulator below treats every step as one token per active sequence. That's the decode phase. The other phase is prefill: when a request arrives with a 1000-token prompt, those 1000 tokens go through in one parallel pass over the full sequence dimension. That single forward does ~1000× the FLOPs of one decode step — it's compute-bound (lots of matmul, small KV read per position), the mirror of decode's memory-boundedness (consequence B from lesson 01).

This causes a wrinkle for continuous batching: a long prefill in the active set monopolizes the GPU for tens of milliseconds, during which the other (decode) sequences in the same step finish slowly. Time-to-first-token for the new request is fine, but inter-token latency for everyone else hiccups.

Chunked prefill (lesson 10) handles this by breaking long prefills into 512-token slices and packing each slice into a step alongside decode tokens. Because prefill is compute-bound and decode is memory-bound, the fused op hits both bottlenecks simultaneously rather than waiting on one. For now: just know the simulator's "one token per step" is a decode-only abstraction.

Where continuous batching gives less

The math says η goes from mean(T) / max(T) to 1. The denominator and numerator are both Σ T_i in the continuous case only if there's always something in the queue. Two regimes erode the win:

Capacity-saturated. If your steady-state active set is already at max_active and the variance in T_i is low (e.g. you're benchmarking on a fixed-length workload), static and continuous batching are nearly identical. Continuous batching's headline number is a function of length variance.
Bursty light load. If the queue is often empty, continuous batching's active set drops below max_active and the GPU runs partial-batch steps. The kernel does less per cycle. Static batching has the same problem (worse, actually — it has to wait to fill a batch) but the comparison narrows.

The juicy regime is moderate load with high length variance — which is exactly what chat workloads are.

Caveat

Caveat: the 2–4× production gap quoted by vLLM benchmarks is closer to the lower bound when the active set is consistently full and request lengths are similar.

Interactive · feel the stragglers

Two panes, one workload. The top pane is static batching: each colored bar is a request, each ▪ gray wedge after a bar is the request sitting in its slot emitting pad tokens while the batch's straggler runs out the clock. The bottom pane is continuous batching on the same workload: bars are tighter because every step admits whatever's pending and evicts whatever's done. Pull the "% long" slider up and watch the gray wedges in the top pane balloon while the bottom barely changes.

Static vs continuous batching · the same bimodal workload

Top: static (batch fixed, wait for stragglers). Bottom: continuous (admit on arrival, evict on <eos>). Bars are request lifetimes; gray = wasted cycles (slot occupied, no useful output). KPIs at the bottom. Note: "fill rate" measures slot occupancy (mean_active / max_active), so it falls below 1 when the queue drains — not a sign of waste, just an under-full active set.

workload size: 32 max_active: 4

% long requests: 20 long length: 400

static cycles

—

continuous cycles

—

static η

—

continuous fill rate

—

mean latency (static)

—

mean latency (continuous)

—

speedup

—

show the scheduler logic this widget simulates

// static: batches of B fixed-membership, run for max(output_len) steps each.
while i < len(requests):
    batch = requests[i:i+B];  i += B
    cycles += max(r.output_len for r in batch)
    useful += sum(r.output_len for r in batch)

// continuous: per-step admit / evict.
while pending or active:
    while pending and len(active) < max_active: active.append(pending.pop())
    for r in active: r.tokens_done += 1
    useful += len(active);  cycles += 1
    active = [r for r in active if r.tokens_done < r.output_len]

Where this sits in the optimization stack

Lesson	What it changed about this picture
02 PagedAttention	Made the "admit a new request" step O(blocks_needed) instead of impossible.
04 (this)	Made the scheduler decide per-iteration, not per-batch.
10 Chunked prefill	Pack long prefills into the same fused step as decodes, so prefill doesn't stall the decode tail.
11 Preemption + swap	What happens when admit fails — the KV pool is full and you have to evict an in-flight sequence.

Takeaway

Static-batching utilization is mean(T) / max(T). On bimodal traffic that's ~0.37. Continuous batching makes the per-step admit/evict decision and pushes it to ~1, but only because paged KV makes admit cheap. The "win" is workload-dependent: it grows with length variance and shrinks under saturation.