system_ml / 19 · CUDA Graphs & TensorRT lesson 19 / 19

CUDA Graphs & TensorRT — serve-time graph capture

Decode emits one token per forward; each forward is hundreds of kernel launches; each launch is ~1–2 μs of GPU-side overhead. Capture the launches once, replay forever — and at the limit, hand the whole model to TensorRT and let it pick a globally-optimised plan.

What launches cost

From lesson 13 we counted host-side Python overhead (~5–10 μs per op). There's a second cost we haven't isolated yet: the device-side launch overhead, the time the GPU spends accepting and starting each kernel. On H100 this is roughly:

This is invisible to most training code because backward kernels are large enough that the bubble is negligible. But for decode (~hundreds of launches per token, each kernel ~10–50 μs for a 7B model), the bubble is a meaningful fraction of step time. Even with torch.compile reducing host overhead, the device-side bubble between dozens of small kernels per attention block doesn't go away.

CUDA Graphs — capture once, replay many times

A CUDA Graph is a recorded sequence of kernel launches (and memcpys, and events) that the driver hands to the GPU as a single submission. The GPU's command processor sees one launch, expands it into the captured sequence internally, and runs through them with no host involvement and minimal between-kernel bubble.

Two ways to build one:

PyTorch exposes this via torch.cuda.CUDAGraph:

g = torch.cuda.CUDAGraph()
static_input  = torch.empty_like(x)
static_output = torch.empty_like(model(x))

# warmup (build cuDNN/cuBLAS workspaces, autotune kernels)
for _ in range(3):
    static_output = model(static_input)

# capture
with torch.cuda.graph(g):
    static_output = model(static_input)

# replay — fast
for batch in loader:
    static_input.copy_(batch)
    g.replay()
    use(static_output)

Replay is a single device-side submission. For a decoder model with batch=1, this typically cuts step time by 20–40% — entirely from launch overhead reduction.

Animated · launch-overhead timeline

Same workload, two timelines. Top: eager dispatch — each kernel launch is preceded by a Python/CUDA-driver bubble (~10 μs). Bottom: CUDA Graph replay — one submission, kernels run back-to-back with minimal gap. Hit play to see real-time progress; both lines run at the same kernel speed, only the bubbles differ.

Eager vs CUDA Graph · launch overhead timeline
Each cell is one kernel. Grey gaps are launch bubbles (host/driver + device-side). Same N kernels in both rows, same per-kernel work; difference is dispatch overhead.
eager finish
graph finish
speedup
launch overhead

The shape-dependence trap

A captured graph has every tensor's shape, stride, and address baked in. Two things follow:

  1. Inputs and outputs must be at fixed addresses. You can't pass a new tensor each call; you copy the new data into the same buffer ("static input"). For inference this matches naturally (static_input.copy_(batch)); for training the data loader has to write into a pinned destination.
  2. Shapes must be fixed. A different sequence length, batch size, or KV-cache length needs a different captured graph. Production serving stacks capture one graph per (batch, sequence) bucket, with (1, 1, 2, 4, 8, 16, 32) being common.

This is also why mode="reduce-overhead" in torch.compile only kicks in when shapes are stable: the compiler wraps the compiled forward in a CUDA Graph, but if shapes change it has to give up and re-capture (or fall back to ungraphed execution).

The KV cache problem in graph capture
In decode, the KV cache grows by one token per step. A naïve capture bakes in "KV length = 137" — the next step needs "KV length = 138". Solutions: (a) capture one graph per length up to max_length and switch between them (memory-expensive), or (b) over-allocate KV to max_length and write into the next slot each step (one graph, used everywhere — vLLM does this with paged KV). The second is why paged attention was the right architecture for graph capture.

2D · the shape-dependence trap

A graph was captured at batch size B_capture = 4. Slide the request batch size. If you land on the captured shape: instant replay (fast). If you don't: the graph invalidates and you pay re-capture (slow) or fall back to ungraphed execution. Production stacks capture one graph per bucket so the slider always lands on a marker.

Shape-dependence · captured graphs vs request shape
Yellow markers = captured shapes (bucket boundaries). Slide the request batch; the widget says replay / re-capture / fall-back, and shows the per-step cost.
status
step time (ms)
wasted compute
used bucket

TensorRT — graph capture's professional cousin

TensorRT (NVIDIA's inference compiler) sits one level up from CUDA Graphs. The classical TRT path takes an ONNX model (or its own builder API) and outputs a plan file — a pre-compiled, layer-by-layer optimised binary. (TRT-LLM, the transformer-specific superstructure, doesn't go through ONNX; it has its own Python builder that consumes Hugging Face / Megatron checkpoints and emits the same plan-file format.) The plan file includes:

TensorRT-LLM is a higher-level TRT-based stack specifically for transformer LLMs. It adds:

TRT-LLM is what most NVIDIA reference benchmarks (MLPerf, etc.) run. The serving-time speedup over a well-tuned vLLM is usually 10–30% on H100, with the gap closing as vLLM picks up similar techniques.

3D · TensorRT layer plan, isometric

A TRT plan file is a stack of layers, and for each layer the builder picked one pre-tuned kernel per supported input shape. Below, the stack is shown isometrically; each floor is one layer; the cards in the floor are the per-shape kernel choices. Click a layer to see its plan entries — the bottom strip shows which kernel runs for each shape bucket.

TensorRT plan · isometric layer stack
Click a layer. Tilt with rotation slider. Card color = chosen kernel family (cuBLAS, FlashAttn, custom). The bottom panel shows the per-shape kernel selection table for the selected layer.
selected layer
— click one —
kernels in plan
picked kernel
vs cuBLAS default

Where the stack ends — and where the wins are

ModeWhat it removesTypical decode win on a 7B model
Eager PyTorchnothing — baseline1.0×
torch.compile defaultPython overhead, some kernel-launch fusion~1.3–1.6×
torch.compile reduce-overhead (CUDA Graphs)device-side bubble, residual host overhead~1.6–2.0×
vLLM (paged attn + Triton/CUDA kernels + CUDA Graphs)same + paged KV + continuous batching~2.5–4× (at scale)
TensorRT-LLM (custom CUDA kernels + persistent kernels + autotuned per shape)everything torch.compile can't~3–5× (at scale)

The relative wins shrink at higher batch sizes — once you're at batch=32 decoding, the per-launch overhead matters less because each launch is doing more work. The biggest wins from graph capture are at the latency-sensitive batch=1 chat regime.

The asymmetry between training and inference compile

Training compile and inference compile look similar but have opposite priorities:

What we walked through — the whole stack, top to bottom

  1. Lessons 01–10: the cluster-level layout — how the model is sharded across thousands of GPUs.
  2. Lessons 11–12: the inference-time layer — how the model is reassembled to serve users.
  3. Lesson 13: the PyTorch framework that lets you write the model in Python.
  4. Lesson 14: the precisions the math runs in.
  5. Lesson 15: the allocator that recycles memory between ops.
  6. Lesson 16: the fusion principle — kernels' boundaries are HBM round-trips.
  7. Lesson 17: Triton, the DSL that lets one engineer write a fused kernel.
  8. Lesson 18: torch.compile, which writes the fused kernels for you.
  9. Lesson 19 (here): CUDA Graphs and TensorRT, which capture the whole forward into one replayable artifact.

Every layer of this stack exists because the layer above wanted something — fewer launches, less overhead, less memory, more fusion, faster reads. Reading top-down, that's why each layer exists. Reading bottom-up, that's why each layer is shaped the way it is.

Interactive · the launch-overhead simulator

Pick a model (ops per forward), a batch, and the per-launch cost. The widget compares eager, compile-default, compile-reduce-overhead, and TRT-LLM-style. The point is: as ops-per-forward grows and per-launch work shrinks (i.e. you're in the decode regime), the gap between modes widens.

Decode latency · stack-by-stack
Approximate model: ops launches per token. Each launch costs py_us of host work and device_us of GPU launch overhead. Kernel time k_us per op runs on the GPU regardless.
eager (ms/tok)
compile (ms/tok)
reduce-overhead
TRT-LLM (ms/tok)
Takeaway
Eager → compile saves host overhead. Compile → reduce-overhead saves device launch overhead. Reduce-overhead → TRT-LLM saves the kernels themselves (pre-tuned per shape, persistent kernels eliminating remaining bubbles). Each step removes a different cost. For latency-sensitive serving the stack matters; for throughput-sensitive batched workloads it matters less.
Where this connects
This whole stack sits underneath the parallelism of Parts I–III. The rules don't change: TP, FSDP, EP, PP all still apply — but now you know what's happening on each GPU in addition to between them. Part V (lessons 20–28) takes us one step further down: what a CUDA kernel actually is, the hardware execution model, and how to write one from first principles.