Scheduling and graph capture
Lessons 03–05 made each kernel cheaper. This lesson makes the gaps between kernels cheaper. Continuous batching folds new requests in without restarting; chunked prefill keeps a long prompt from freezing decoders; CUDA graphs erase the per-launch CPU dispatch cost.
The question this lesson answers
If kernels are now efficient, why does decode-step latency on small batches plateau around a few hundred microseconds — well above the roofline lower bound from lesson 02? Because the work between launches stops being free. The scheduler must build a batch every step, Python dispatches each kernel, the driver enqueues commands, and the GPU waits between blocks. At small batch each of those costs is comparable to the actual math. This lesson is about hiding them.
Static batching is the wrong baseline
Naive batched inference says: collect N requests, prefill them all, decode them all in lockstep, return when the slowest finishes. Three things go wrong:
- A short response sits idle waiting for a long response → wasted GPU.
- New requests can't join until the next batch boundary → high time-to-first-token under bursty traffic.
- Variable lengths force padding to the max → wasted compute and memory.
Continuous batching
The fix is to dissolve the "batch" boundary. At every decode step the scheduler:
- Examines the running batch and removes any sequence that finished (hit EOS, stop sequence, or max-tokens).
- Pulls eligible waiting requests from a queue and admits as many as fit in the KV pool and the per-step token budget.
- Combines admitted prefill tokens and existing decode tokens into one mixed batch.
- Launches one forward pass that handles the mix.
Chunked prefill
Continuous batching has a residual problem: a request with an 8K prompt processes all 8K tokens in one step. That step is dominated by a giant prefill GEMM, which can take 50–200 ms. Every decoding sequence in the batch waits for it. Time-to-first-token of the new request improves; latency of every other request gets worse.
Chunked prefill splits the prompt across several scheduling steps. Step k processes a chunk of N_prefill tokens of the long prompt, plus N_decode tokens from other sequences, all in one mixed kernel call.
The scheduler runs a token-budget allocator. Typical defaults pick a budget like 2048 or 4096 tokens per step; that is split between active decoders (one token each) and one or more in-flight prefill chunks. Each engine has knobs:
--max-num-batched-tokens(vLLM): step token budget.--long-prefill-token-threshold/--enable-chunked-prefill(vLLM): when chunking kicks in.--chunked-prefill-size(SGLang): per-step prefill cap.
CUDA graphs: erasing dispatch overhead
Even after the scheduler is perfect, the per-step Python dispatch is real. For Llama-3-70B decode at batch 32, one step is around 80–200 small kernel launches. At ~5 µs each that is 0.4–1.0 ms of pure CPU-side dispatch — comparable to the actual GPU math.
CUDA graphs solve this by capturing a stream of kernel launches once and replaying it many times with a single command. The CPU goes from "launch 200 kernels" to "replay graph." The GPU runs the same kernels in the same order; only the dispatch is bypassed.
The catch is that the captured graph runs exactly what it captured. If shapes change (different batch size, different sequence length, different number of pending sequences), you cannot use that graph. Production engines therefore capture multiple graphs at preset batch sizes (e.g., 1, 2, 4, 8, 16, 32, 64, 128, 256), and pad up to the nearest captured size when running.
Graph capture costs and constraints
| Constraint | Why it exists | Workaround |
|---|---|---|
| Fixed shapes | Each kernel was launched with specific dims. | Pad batch and KV table sizes to nearest captured shape. |
| Fixed memory addresses | Captured pointers must still be valid at replay. | Persistent buffers; never reallocate between captures. |
| No CPU-dependent control flow | Branches cannot depend on host state. | Express everything as masked tensor ops or use conditional graph nodes. |
| Capture is slow (~seconds) | Driver compiles a launch plan. | Capture during warmup, never on the request path. |
| Replay still costs VRAM | Captured graphs hold buffers. | Don't capture every conceivable shape; pick a handful and pad. |
Interactive · token budget & graph capture
Set per-step token budget (split between prefill chunks and decoders) and the number of small kernel launches per step. The widget shows the eager vs graph-replay latency and tells you when graph capture is worth the constraints.
What this gives you for the rest of the track
With scheduling installed, the GPU is rarely idle and the CPU rarely waits. The remaining ground for improvement is inside the non-attention kernels: weight matrices, MoE routing, sampling, communication. That is lesson 07. Then lesson 08 stitches everything into the framework that turns HTTP requests into the structured batches we now know how to run.