From PyTorch op to kernel launch

A single Python statement traverses six or seven layers before a kernel actually runs. Most "PyTorch is slow" surprises live in that traversal, not in the kernel. This lesson walks the path end-to-end so you know exactly where time can hide.

The question this lesson answers

You write y = x @ w. Some time later, an H100 SM finishes a tensor-core mma.sync. What happened in between? And why is that path the reason eager mode is sometimes 5× slower than compiled — even though the same mma runs in the end?

The short answer: PyTorch is a thin Python facade over a large C++ dispatcher that picks the right kernel for the right device, dtype, layout, and autograd state. That dispatch is fast (microseconds) but it runs on every op. When ops are large, that overhead vanishes. When ops are small or many, it dominates.

The dispatch chain

is the runtime.

What each layer actually does

Layer	Job	Typical cost
Python bytecode	Resolve `__matmul__`, push args, return.	~1 µs
pybind11 boundary (the Python ↔ C++ binding library used by torch)	Marshal Python objects into C++ tensors.	~1 µs
Autograd dispatch (the VariableType layer, named for the auto-generated C++ wrappers under `torch/csrc/autograd/generated/`)	If `requires_grad`, record an autograd node for backward; bump version counters.	~1–3 µs
ATen dispatcher	Look up `(op_name, dispatch_key)` → function pointer. The dispatch key encodes device, dtype, layout, autograd state.	~1–2 µs
Backend (cuBLAS / CUTLASS / custom)	Pick an algorithm by shape; allocate workspace.	~1–5 µs (cached after first call)
Driver (`cudaLaunchKernel`)	Enqueue grid+block+args onto a stream.	~1–3 µs on CUDA 12+ drivers
GPU command queue	Dispatch blocks to SMs.	~ns to µs
SM execution	What lessons 01–09 covered.	µs to ms, the actual math

The crucial sum is the "before kernel" total: ~5–25 µs of CPU-side work per op. That cost is fixed regardless of how big the kernel is. CUDA graphs (lesson 15) erase most of it by replaying the entire chain in one launch.

Principle

Every PyTorch op pays a fixed CPU-side dispatch tax of roughly 5–25 µs. Make your kernels larger (one big matmul beats ten small ones) or fewer (fuse pointwise chains, capture into a CUDA graph). The roofline from lesson 01 + this tax explains every "eager is slow" report.

Why eager mode is sometimes 5× slower

A transformer layer in eager mode might launch 80–200 separate kernels per token. At 15 µs of CPU-side dispatch per launch, that is 1.2–3.0 ms of pure Python+driver overhead per token — comparable to the GPU math itself at small batch. The compiler's job (lesson 23) is to fuse many ops into fewer kernels and capture the launch sequence into a graph.

# eager: every op dispatches separately
h = norm(x)              # launch 1: layernorm
qkv = lin(h)             # launch 2: QKV gemm
q, k, v = qkv.chunk(3, -1)   # launches 3,4,5 (views — no kernels in this case)
attn = sdpa(q, k, v)         # launch 6: attention (might be many internal launches)
o = lin_o(attn)              # launch 7: out proj
y = x + o                    # launch 8: add
y = norm2(y)                 # launch 9: layernorm
...
# total: maybe 12 kernels for one block.

# compiled: torch.compile fuses adjacent pointwise ops, captures shapes,
# and may replay through CUDA graphs. Often 2–4 kernels for the same block.

Sync points — the silent latency multipliers

CPU and GPU run asynchronously. The CPU queues work; the GPU executes when it gets to it. A synchronization point forces the CPU to wait until the GPU catches up. Every sync point converts hidden latency into visible latency.

Operation	Forces sync?	Why
`x.item()` / `float(x)` / `int(x)`	Yes	Need value on CPU now.
`x.cpu()` / `x.numpy()`	Yes	Data must move to host.
`print(x)`	Yes	Implicit `.cpu()` to format.
`if x.any():` / data-dependent control	Yes	Branch depends on tensor value.
`x.size()` / `x.shape` in pure code	No	Shapes are metadata, not data.
`x.nonzero()` / `x.unique()`	Yes	Output shape depends on tensor values.
`x.tolist()` / `torch.save(x)`	Yes	Need data on CPU.
Allocations of new tensors	No (usually)	Caching allocator returns cached blocks.
`torch.cuda.synchronize()`	Yes (explicit)	Drains the stream.
`y = x + 1` where `y` depends on prior op	No	Just queues another op on the same stream.

A common bug: a debug print(loss.item()) deep in a training loop forces a sync every iteration. The sync flushes the launch queue and serializes everything — turning a 50 ms step into a 200 ms step with no obvious cause.

What "fused" really means at this layer

Three different fusions happen at three different levels of the dispatch chain. Distinguishing them matters when you read flame graphs:

Op-level fusion (Inductor / Triton): Several PyTorch ops compile into one kernel. One dispatch, one launch. Done by torch.compile and hand-written Triton (lesson 22).
Kernel-internal fusion (e.g. fused epilogue): A GEMM's tail (bias add, activation) is folded into the GEMM kernel itself by cuBLAS / CUTLASS. One launch, one kernel — but the kernel does more work.
Graph fusion (CUDA graphs): Many kernels are captured and replayed as a single launch (lesson 15). The kernels are unchanged; the dispatch chain is bypassed.

Tools to see this

Three commands take you from "I have no idea where time goes" to "I can name the kernel and its caller." Lesson 20 expands each one, but the headline distinction is:

torch.profiler — shows Python ops + their CUDA kernels, side-by-side. Best when the question is "which PyTorch line is slow."
NVIDIA nsys (Nsight Systems) — system-wide timeline including CPU dispatch, CUDA streams, NVLink, NCCL. Best when the question is "where are the gaps."
NVIDIA ncu (Nsight Compute) — per-kernel deep dive: occupancy, memory throughput, warp stalls. Best when the question is "this kernel is slow, why."

Interactive · CPU dispatch budget

Set ops-per-step, per-op dispatch cost, and per-op GPU work. The widget tells you whether the step is CPU-bound (dispatch overhead) or GPU-bound (kernel work). This is the back-of-envelope for "is torch.compile going to help me."

What this gives you for the next lesson

You now know that "GPU code runs slowly" can mean three completely different things — slow dispatch, slow kernel, or slow sync. The next lesson opens a real fused kernel and walks through what's inside, so you can read the things the dispatcher routes to. After that, lessons 20 and 21 cover the tools that tell you which of the three is actually your problem.