all_lessons/gpu_kernel_serving/15 · scheduling & graphslesson 15 / 17

Scheduling and graph capture

Lessons 03–05 made each kernel cheaper. This lesson makes the gaps between kernels cheaper. Continuous batching folds new requests in without restarting; chunked prefill keeps a long prompt from freezing decoders; CUDA graphs erase the per-launch CPU dispatch cost.

The question this lesson answers

If kernels are now efficient, why does decode-step latency on small batches plateau around a few hundred microseconds — well above the roofline lower bound from lesson 02? Because the work between launches stops being free. The scheduler must build a batch every step, Python dispatches each kernel, the driver enqueues commands, and the GPU waits between blocks. At small batch each of those costs is comparable to the actual math. This lesson is about hiding them.

Static batching is the wrong baseline

Naive batched inference says: collect N requests, prefill them all, decode them all in lockstep, return when the slowest finishes. Three things go wrong:

  1. A short response sits idle waiting for a long response → wasted GPU.
  2. New requests can't join until the next batch boundary → high time-to-first-token under bursty traffic.
  3. Variable lengths force padding to the max → wasted compute and memory.

Continuous batching

The fix is to dissolve the "batch" boundary. At every decode step the scheduler:

  1. Examines the running batch and removes any sequence that finished (hit EOS, stop sequence, or max-tokens).
  2. Pulls eligible waiting requests from a queue and admits as many as fit in the KV pool and the per-step token budget.
  3. Combines admitted prefill tokens and existing decode tokens into one mixed batch.
  4. Launches one forward pass that handles the mix.
static batching req A · 9 tokens req B · 4 tok idle (B done, waits for A) req C · 8 tok queue: D, E, F wait for next batch continuous batching req A · 9 tokens req B · 4 tok req D joins immediately when B finishes req C · 8 tok req E joins when C finishes x-axis is decode steps. Each step is one forward pass containing whoever is currently active. No idle bubbles.

Chunked prefill

Continuous batching has a residual problem: a request with an 8K prompt processes all 8K tokens in one step. That step is dominated by a giant prefill GEMM, which can take 50–200 ms. Every decoding sequence in the batch waits for it. Time-to-first-token of the new request improves; latency of every other request gets worse.

Chunked prefill splits the prompt across several scheduling steps. Step k processes a chunk of N_prefill tokens of the long prompt, plus N_decode tokens from other sequences, all in one mixed kernel call.

unchunked: one giant prefill step blocks decoders prefill A · 8K tokens (one step, ~100 ms) decoders B, C, D stalled the entire time then everyone decodes again chunked: prefill split into 8 chunks, decoders continue every step prefill A chunks (1K each) decoders B, C, D — one token per step Each step has a fixed token budget = chunk_tokens + decode_tokens. Larger budget → fewer steps but bigger per-step latency. Smaller → more steps, more launch overhead.

The scheduler runs a token-budget allocator. Typical defaults pick a budget like 2048 or 4096 tokens per step; that is split between active decoders (one token each) and one or more in-flight prefill chunks. Each engine has knobs:

Variant you'll hear about: PD disaggregation
The extreme version of avoiding prefill–decode interference is to run them on separate GPUs entirely: a prefill pool produces KV for a request, transfers it to a decode pool, and only the decode pool serves tokens. Trade-off: the KV transfer becomes a system path (NVLink or network). It pays off mainly for very-long-prompt, short-output workloads. Lesson 08 returns to this as a serving-architecture decision.

CUDA graphs: erasing dispatch overhead

Even after the scheduler is perfect, the per-step Python dispatch is real. For Llama-3-70B decode at batch 32, one step is around 80–200 small kernel launches. At ~5 µs each that is 0.4–1.0 ms of pure CPU-side dispatch — comparable to the actual GPU math.

CUDA graphs solve this by capturing a stream of kernel launches once and replaying it many times with a single command. The CPU goes from "launch 200 kernels" to "replay graph." The GPU runs the same kernels in the same order; only the dispatch is bypassed.

eager dispatch (CPU is busy every kernel) red = CPU launch overhead; blue = actual GPU kernel time. At small batch the red blocks dominate. graph replay (one launch, no per-kernel CPU) one red block at launch; all kernels run back-to-back. The CPU thread is free to schedule the next step concurrently. savings ≈ (n_kernels − 1) × launch_overhead. At 100 kernels × 5 µs that is ~0.5 ms per token — large at small batch, less relevant at large batch where kernels are long.

The catch is that the captured graph runs exactly what it captured. If shapes change (different batch size, different sequence length, different number of pending sequences), you cannot use that graph. Production engines therefore capture multiple graphs at preset batch sizes (e.g., 1, 2, 4, 8, 16, 32, 64, 128, 256), and pad up to the nearest captured size when running.

Graph capture costs and constraints

ConstraintWhy it existsWorkaround
Fixed shapesEach kernel was launched with specific dims.Pad batch and KV table sizes to nearest captured shape.
Fixed memory addressesCaptured pointers must still be valid at replay.Persistent buffers; never reallocate between captures.
No CPU-dependent control flowBranches cannot depend on host state.Express everything as masked tensor ops or use conditional graph nodes.
Capture is slow (~seconds)Driver compiles a launch plan.Capture during warmup, never on the request path.
Replay still costs VRAMCaptured graphs hold buffers.Don't capture every conceivable shape; pick a handful and pad.
Coupling scheduler and kernel
This is where scheduling and kernel design become coupled. The scheduler must hit shapes the graph supports, so it pads/truncates batches; the kernel author has to make sure the captured shape is the fast shape. A "fast kernel that isn't capturable" can lose to a "slower kernel that runs in a graph."

Interactive · token budget & graph capture

Set per-step token budget (split between prefill chunks and decoders) and the number of small kernel launches per step. The widget shows the eager vs graph-replay latency and tells you when graph capture is worth the constraints.

When CUDA graphs matter

Eager: n × (launch + work). Graph: 1 × launch + n × work. At small batch and many launches, graph capture is the difference between bandwidth-bound and launch-bound.

Variant you'll hear about: speculative decoding
A draft model proposes k tokens; the target model verifies them in parallel. The verify step has k+1 Q rows per sequence instead of one, moving the step slightly toward prefill on the arithmetic-intensity axis. That makes the same kernels and graphs more efficient. The scheduler must accept that some steps reject every draft and revert to ordinary decode.

What this gives you for the rest of the track

With scheduling installed, the GPU is rarely idle and the CPU rarely waits. The remaining ground for improvement is inside the non-attention kernels: weight matrices, MoE routing, sampling, communication. That is lesson 07. Then lesson 08 stitches everything into the framework that turns HTTP requests into the structured batches we now know how to run.