From PyTorch op to kernel launch
A single Python statement traverses six or seven layers before a kernel actually runs. Most "PyTorch is slow" surprises live in that traversal, not in the kernel. This lesson walks the path end-to-end so you know exactly where time can hide.
The question this lesson answers
You write y = x @ w. Some time later, an H100 SM finishes a tensor-core mma.sync. What happened in between? And why is that path the reason eager mode is sometimes 5× slower than compiled — even though the same mma runs in the end?
The short answer: PyTorch is a thin Python facade over a large C++ dispatcher that picks the right kernel for the right device, dtype, layout, and autograd state. That dispatch is fast (microseconds) but it runs on every op. When ops are large, that overhead vanishes. When ops are small or many, it dominates.
The dispatch chain
What each layer actually does
| Layer | Job | Typical cost |
|---|---|---|
| Python bytecode | Resolve __matmul__, push args, return. | ~1 µs |
| pybind11 boundary (the Python ↔ C++ binding library used by torch) | Marshal Python objects into C++ tensors. | ~1 µs |
Autograd dispatch (the VariableType layer, named for the auto-generated C++ wrappers under torch/csrc/autograd/generated/) | If requires_grad, record an autograd node for backward; bump version counters. | ~1–3 µs |
| ATen dispatcher | Look up (op_name, dispatch_key) → function pointer. The dispatch key encodes device, dtype, layout, autograd state. | ~1–2 µs |
| Backend (cuBLAS / CUTLASS / custom) | Pick an algorithm by shape; allocate workspace. | ~1–5 µs (cached after first call) |
Driver (cudaLaunchKernel) | Enqueue grid+block+args onto a stream. | ~1–3 µs on CUDA 12+ drivers |
| GPU command queue | Dispatch blocks to SMs. | ~ns to µs |
| SM execution | What lessons 01–09 covered. | µs to ms, the actual math |
The crucial sum is the "before kernel" total: ~5–25 µs of CPU-side work per op. That cost is fixed regardless of how big the kernel is. CUDA graphs (lesson 15) erase most of it by replaying the entire chain in one launch.
Why eager mode is sometimes 5× slower
A transformer layer in eager mode might launch 80–200 separate kernels per token. At 15 µs of CPU-side dispatch per launch, that is 1.2–3.0 ms of pure Python+driver overhead per token — comparable to the GPU math itself at small batch. The compiler's job (lesson 23) is to fuse many ops into fewer kernels and capture the launch sequence into a graph.
# eager: every op dispatches separately
h = norm(x) # launch 1: layernorm
qkv = lin(h) # launch 2: QKV gemm
q, k, v = qkv.chunk(3, -1) # launches 3,4,5 (views — no kernels in this case)
attn = sdpa(q, k, v) # launch 6: attention (might be many internal launches)
o = lin_o(attn) # launch 7: out proj
y = x + o # launch 8: add
y = norm2(y) # launch 9: layernorm
...
# total: maybe 12 kernels for one block.
# compiled: torch.compile fuses adjacent pointwise ops, captures shapes,
# and may replay through CUDA graphs. Often 2–4 kernels for the same block.
Sync points — the silent latency multipliers
CPU and GPU run asynchronously. The CPU queues work; the GPU executes when it gets to it. A synchronization point forces the CPU to wait until the GPU catches up. Every sync point converts hidden latency into visible latency.
| Operation | Forces sync? | Why |
|---|---|---|
x.item() / float(x) / int(x) | Yes | Need value on CPU now. |
x.cpu() / x.numpy() | Yes | Data must move to host. |
print(x) | Yes | Implicit .cpu() to format. |
if x.any(): / data-dependent control | Yes | Branch depends on tensor value. |
x.size() / x.shape in pure code | No | Shapes are metadata, not data. |
x.nonzero() / x.unique() | Yes | Output shape depends on tensor values. |
x.tolist() / torch.save(x) | Yes | Need data on CPU. |
| Allocations of new tensors | No (usually) | Caching allocator returns cached blocks. |
torch.cuda.synchronize() | Yes (explicit) | Drains the stream. |
y = x + 1 where y depends on prior op | No | Just queues another op on the same stream. |
A common bug: a debug print(loss.item()) deep in a training loop forces a sync every iteration. The sync flushes the launch queue and serializes everything — turning a 50 ms step into a 200 ms step with no obvious cause.
What "fused" really means at this layer
Three different fusions happen at three different levels of the dispatch chain. Distinguishing them matters when you read flame graphs:
- Op-level fusion (Inductor / Triton): Several PyTorch ops compile into one kernel. One dispatch, one launch. Done by
torch.compileand hand-written Triton (lesson 22). - Kernel-internal fusion (e.g. fused epilogue): A GEMM's tail (bias add, activation) is folded into the GEMM kernel itself by cuBLAS / CUTLASS. One launch, one kernel — but the kernel does more work.
- Graph fusion (CUDA graphs): Many kernels are captured and replayed as a single launch (lesson 15). The kernels are unchanged; the dispatch chain is bypassed.
Tools to see this
Three commands take you from "I have no idea where time goes" to "I can name the kernel and its caller." Lesson 20 expands each one, but the headline distinction is:
torch.profiler— shows Python ops + their CUDA kernels, side-by-side. Best when the question is "which PyTorch line is slow."- NVIDIA
nsys(Nsight Systems) — system-wide timeline including CPU dispatch, CUDA streams, NVLink, NCCL. Best when the question is "where are the gaps." - NVIDIA
ncu(Nsight Compute) — per-kernel deep dive: occupancy, memory throughput, warp stalls. Best when the question is "this kernel is slow, why."
Interactive · CPU dispatch budget
Set ops-per-step, per-op dispatch cost, and per-op GPU work. The widget tells you whether the step is CPU-bound (dispatch overhead) or GPU-bound (kernel work). This is the back-of-envelope for "is torch.compile going to help me."
What this gives you for the next lesson
You now know that "GPU code runs slowly" can mean three completely different things — slow dispatch, slow kernel, or slow sync. The next lesson opens a real fused kernel and walks through what's inside, so you can read the things the dispatcher routes to. After that, lessons 20 and 21 cover the tools that tell you which of the three is actually your problem.