all_lessons/gpu_kernel_serving/18 · pytorch → kernellesson 18 / 24

From PyTorch op to kernel launch

A single Python statement traverses six or seven layers before a kernel actually runs. Most "PyTorch is slow" surprises live in that traversal, not in the kernel. This lesson walks the path end-to-end so you know exactly where time can hide.

The question this lesson answers

You write y = x @ w. Some time later, an H100 SM finishes a tensor-core mma.sync. What happened in between? And why is that path the reason eager mode is sometimes 5× slower than compiled — even though the same mma runs in the end?

The short answer: PyTorch is a thin Python facade over a large C++ dispatcher that picks the right kernel for the right device, dtype, layout, and autograd state. That dispatch is fast (microseconds) but it runs on every op. When ops are large, that overhead vanishes. When ops are small or many, it dominates.

The dispatch chain

Python expressiony = x @ w1 µs · CPython bytecode Tensor __matmul__pybind11 boundaryPython → C++ Autograd dispatchif requires_grad: build nodeVariableType layer ATen dispatchkey: CUDA/CPU, dtype, layoutresolves to kernel fn ptr cuBLASbackendpicks algo cudaLaunchKerneldriver enqueues~1–3 µs CPU GPU command queueDMA, schedulerasync vs CPU grid of blocksscheduled onto SMslatency-hide via warps SM executiontile, mma, write-backyour part (lessons 01–09) Top row is CPU; bottom row is GPU; both rows read left-to-right. Asynchrony lets them overlap — until something forces a sync. Total CPU-side dispatch for one bf16 matmul on H100: ~5–25 µs. For a 4096×4096 GEMM that is <1 % of the runtime. For a tiny pointwise on B=1, that is the runtime.

What each layer actually does

LayerJobTypical cost
Python bytecodeResolve __matmul__, push args, return.~1 µs
pybind11 boundary (the Python ↔ C++ binding library used by torch)Marshal Python objects into C++ tensors.~1 µs
Autograd dispatch (the VariableType layer, named for the auto-generated C++ wrappers under torch/csrc/autograd/generated/)If requires_grad, record an autograd node for backward; bump version counters.~1–3 µs
ATen dispatcherLook up (op_name, dispatch_key) → function pointer. The dispatch key encodes device, dtype, layout, autograd state.~1–2 µs
Backend (cuBLAS / CUTLASS / custom)Pick an algorithm by shape; allocate workspace.~1–5 µs (cached after first call)
Driver (cudaLaunchKernel)Enqueue grid+block+args onto a stream.~1–3 µs on CUDA 12+ drivers
GPU command queueDispatch blocks to SMs.~ns to µs
SM executionWhat lessons 01–09 covered.µs to ms, the actual math

The crucial sum is the "before kernel" total: ~5–25 µs of CPU-side work per op. That cost is fixed regardless of how big the kernel is. CUDA graphs (lesson 15) erase most of it by replaying the entire chain in one launch.

Principle
Every PyTorch op pays a fixed CPU-side dispatch tax of roughly 5–25 µs. Make your kernels larger (one big matmul beats ten small ones) or fewer (fuse pointwise chains, capture into a CUDA graph). The roofline from lesson 01 + this tax explains every "eager is slow" report.

Why eager mode is sometimes 5× slower

A transformer layer in eager mode might launch 80–200 separate kernels per token. At 15 µs of CPU-side dispatch per launch, that is 1.2–3.0 ms of pure Python+driver overhead per token — comparable to the GPU math itself at small batch. The compiler's job (lesson 23) is to fuse many ops into fewer kernels and capture the launch sequence into a graph.

# eager: every op dispatches separately
h = norm(x)              # launch 1: layernorm
qkv = lin(h)             # launch 2: QKV gemm
q, k, v = qkv.chunk(3, -1)   # launches 3,4,5 (views — no kernels in this case)
attn = sdpa(q, k, v)         # launch 6: attention (might be many internal launches)
o = lin_o(attn)              # launch 7: out proj
y = x + o                    # launch 8: add
y = norm2(y)                 # launch 9: layernorm
...
# total: maybe 12 kernels for one block.

# compiled: torch.compile fuses adjacent pointwise ops, captures shapes,
# and may replay through CUDA graphs. Often 2–4 kernels for the same block.

Sync points — the silent latency multipliers

CPU and GPU run asynchronously. The CPU queues work; the GPU executes when it gets to it. A synchronization point forces the CPU to wait until the GPU catches up. Every sync point converts hidden latency into visible latency.

OperationForces sync?Why
x.item() / float(x) / int(x)YesNeed value on CPU now.
x.cpu() / x.numpy()YesData must move to host.
print(x)YesImplicit .cpu() to format.
if x.any(): / data-dependent controlYesBranch depends on tensor value.
x.size() / x.shape in pure codeNoShapes are metadata, not data.
x.nonzero() / x.unique()YesOutput shape depends on tensor values.
x.tolist() / torch.save(x)YesNeed data on CPU.
Allocations of new tensorsNo (usually)Caching allocator returns cached blocks.
torch.cuda.synchronize()Yes (explicit)Drains the stream.
y = x + 1 where y depends on prior opNoJust queues another op on the same stream.

A common bug: a debug print(loss.item()) deep in a training loop forces a sync every iteration. The sync flushes the launch queue and serializes everything — turning a 50 ms step into a 200 ms step with no obvious cause.

What "fused" really means at this layer

Three different fusions happen at three different levels of the dispatch chain. Distinguishing them matters when you read flame graphs:

Tools to see this

Three commands take you from "I have no idea where time goes" to "I can name the kernel and its caller." Lesson 20 expands each one, but the headline distinction is:

Interactive · CPU dispatch budget

Set ops-per-step, per-op dispatch cost, and per-op GPU work. The widget tells you whether the step is CPU-bound (dispatch overhead) or GPU-bound (kernel work). This is the back-of-envelope for "is torch.compile going to help me."

Where does the step actually spend its CPU time?

Eager mode pays per-op dispatch; compiled mode fuses ops down to fewer kernels. A B200 + tiny model is launch-bound; a 70B + H100 + large batch is GPU-bound.

What this gives you for the next lesson

You now know that "GPU code runs slowly" can mean three completely different things — slow dispatch, slow kernel, or slow sync. The next lesson opens a real fused kernel and walks through what's inside, so you can read the things the dispatcher routes to. After that, lessons 20 and 21 cover the tools that tell you which of the three is actually your problem.