torch.compile — Dynamo + AOT Autograd + Inductor
Three components, each with its own job. The first captures a graph out of your Python. The second attaches a backward pass. The third generates kernels. Most of "torch.compile is slower than I expected" is one of them silently falling back.
The three-stage pipeline
Stage 1 — Dynamo, the Python-bytecode tracer
Dynamo intercepts a function call at the Python bytecode level via CPython's frame-evaluation API (PEP 523, available since CPython 3.6; later versions like 3.11+ extended the hook with more useful per-instruction control). It runs the function's instructions symbolically on "FakeTensor" placeholders, recording operations into an FX graph. The result is two things:
- An FX graph — a clean DAG of PyTorch ops, no Python control flow inside.
- A set of guards — runtime predicates that, if violated by future inputs, invalidate the captured graph (e.g. "argument 0 is a CUDA bf16 tensor of shape (4, 4096)").
This is genius and constraining at once. Because guards check shapes/dtypes/devices on every call, dynamic Python (data-dependent control flow, runtime-shape branches) breaks the trace — Dynamo emits a graph break: it falls back to eager mode for the unsupported chunk, then resumes tracing afterwards. A function with three graph breaks compiles into three small graphs interleaved with eager.
.item()or.tolist()— pulls a tensor value into Python, makes downstream control flow data-dependent.print(),logging, custom assertions on tensors.- Python-level iteration over tensor data.
tensor.shape[0]used as a Python int in an arithmetic chain (Dynamo's "dynamic shapes" mode handles many of these now, but not all).- Calls into unsupported Python libraries.
TORCH_LOGS="graph_breaks". Often you can refactor a single .item() away and a 12-graph trace becomes a 1-graph trace.
Animated · Dynamo graph capture, line by line
Watch Dynamo read Python code from left to right, extracting nodes into the FX graph on the right. When it hits a .item() or other unsupported op, the graph breaks: the partial graph is sealed, eager Python runs for a step, then a fresh graph starts. Pick a workload to see different break patterns.
Stage 2 — AOT Autograd
Dynamo's FX graph is forward only. To compile a training step we also need backward. AOT (Ahead-Of-Time) Autograd takes the forward graph, runs PyTorch's autograd machinery once at compile time to derive the backward graph, then hands both to Inductor as a single combined graph.
It also decomposes high-level ops into a smaller set of "primitive" ops. For example, aten::softmax decomposes into max, sub, exp, sum, div. The decomposition exposes fusion opportunities that the original opaque op hides. This is why torch.compile often beats eager softmax: the decomposed primitives can be fused into one Triton kernel and chained with adjacent pointwise ops (dropout, scale, residual). They do not get fused into the matmul body itself — matmul stays in cuBLAS — but they can avoid the HBM round-trip between matmul output and the next pointwise op via Inductor's epilogue-fusion path.
AOT Autograd also handles the functionalization step: in-place ops (add_, copy_) are rewritten as out-of-place versions, so the downstream pipeline can reason about a pure data-flow graph. This is one of the more invasive transformations and occasionally causes subtle bugs in code that depended on in-place semantics.
2D · AOT autograd flow — click an op for its backward partner
The forward graph is on top, the backward graph mirrors it on the bottom (in reverse order). Click any forward op to see which backward op it generates and which intermediate tensors get saved. The "saved" tensors are the activations — they live in HBM between forward and backward and dominate the activation memory budget.
Stage 3 — Inductor, the codegen
Inductor takes a pure FX graph and outputs:
- Triton kernels for elementwise chains, reductions, pointwise+reduction fusions. These are autotuned at first call (per lesson 17).
- C++ kernels for CPU paths.
- Calls into cuBLAS / cuDNN for matmul and conv — Inductor doesn't try to outperform the vendor library.
- An orchestrator that ties them together with explicit allocation and buffer reuse.
Inductor's three big optimisations:
- Vertical fusion. Chains of elementwise / pointwise ops become a single Triton kernel — the lesson-16 fusion case.
- Horizontal fusion. Independent ops with the same shape can share a kernel launch.
- Memory planning. The compiler knows the lifetime of every tensor and reuses buffers, avoiding allocator churn for known intermediates. This is one of the larger wins in practice.
The four modes of torch.compile
| Mode | What it adds | Use case |
|---|---|---|
"default" | Dynamo + AOT + Inductor | Most training. 1.3–2× typical speedup. |
"reduce-overhead" | Default + CUDA Graphs wrapping (lesson 19) | Decode-heavy inference. Drops launch overhead. |
"max-autotune" | Default + exhaustive autotuning of matmul shapes | Long training runs that justify minutes of compile time. |
"max-autotune-no-cudagraphs" | max-autotune but skip CUDA Graph capture | Dynamic-shape workloads. |
Three failure modes, in order of how often they bite
- Graph breaks. A 100-line model with one
.item()deep in it gets cut into many small graphs. The Python tax you wanted to avoid is back. Diagnose withTORCH_LOGS="graph_breaks"; fix by removing data-dependent Python. - Recompilation. Every new shape, dtype, or device combination triggers a fresh compile. By default Dynamo has a recompile budget (typically 8 per call site); past that it gives up and falls back to eager. Diagnose with
TORCH_LOGS="recompiles"; fix by usingdynamic=Trueor by padding inputs to a fixed shape. - Inductor fallbacks. Some ops (sparse, custom, weird-stride) aren't supported by Inductor and become a synchronous cuBLAS/eager call. Doesn't break correctness but kills the fusion win. Diagnose with
TORCH_LOGS="output_code".
2D · failure-modes timeline
A simulated training loop. Step time is plotted across 200 steps; each failure mode can be toggled on. Watch the spikes: graph breaks add a small constant overhead per step; recompilations land a huge spike every time a new shape is seen; fallbacks add per-step overhead that doesn't recover.
Compile cost — when is it worth it?
First-call compile time is real: Dynamo trace + AOT decomposition + Inductor codegen + autotuning often adds 10–60 seconds to the first forward, then per-shape recompiles on top. The breakeven against eager is somewhere around 1000 steps for training and ~100 prompts for inference (since each saves a few ms).
If you're doing a long training run, compile pays off in the first hour. If you're doing one-off research code, the compile time can dominate. PyTorch's compile cache (on-disk by default since 2.5) makes the second run of the same model much faster.
Where torch.compile beats hand-written CUDA — and where it doesn't
- Beats hand-written when the win is "fuse this long chain of pointwise ops." A 20-op decode-norm path collapsed to one Triton kernel is a thing humans rarely write but compilers reliably produce.
- Beats hand-written when memory planning is the bottleneck. Compilers can reuse buffers a human author would forget.
- Loses to hand-written when the fusion crosses a matmul (FlashAttention-class). Inductor doesn't yet generate fused matmul + custom-softmax kernels at FlashAttention quality.
- Loses to hand-written when the kernel needs hardware-specific tricks: warp specialisation on Hopper, persistent kernels, async copy double-buffering. Triton-via-Inductor can't reach those reliably.
Inference vs training compile
For training, the AOT-Autograd path matters: backward is compiled too, and the activation-saving optimisation (Inductor can decide what to recompute vs save based on costs) is a real lever. For inference, you usually want mode="reduce-overhead" — same compile path plus CUDA Graphs around the compiled forward. The graph-break problem is worse at inference (decoder loops have inherent Python in them) but the per-step compile is amortised over thousands of tokens.
Interactive · what does the compile pipeline keep, fuse, fall back?
Pick a forward pass shape and trace it through Dynamo's eyes: which calls become graph breaks (and why), which decompose, which fuse. The widget is a simulation of how a few common motifs land in the pipeline — illustrative, not exhaustive.
torch.compile is three stages: capture, derive backward, codegen. Each can fall back. Most of the speedup comes from Inductor's vertical fusion + memory planning; the rest from CUDA Graphs in reduce-overhead. The job of the engineer is to keep the graph clean — no .item(), no Python control flow on tensor values, no exotic ops — and let the compiler do its work.