all_lessons/gpu_kernel_serving/24 · performant pytorchlesson 24 / 24

Performant PyTorch patterns & synthesis

Compile, Triton, and profilers fix the kernels. This lesson fixes everything around the kernels: memory format, allocator pressure, async data movement, mixed precision, and the half-dozen tiny patterns that decide whether a healthy kernel actually runs healthy in your program.

The question this lesson answers

Your kernels are fast. The compiler is on. The profile says GPU utilization is 65 %. Where is the other 35 % going? Usually it's PyTorch-level: a host stall in the dataloader, an allocator hiccup, a needless contiguous(), an autograd graph held longer than it should be. This lesson is a checklist for the patterns that move that 35 %.

Mental model for the whole track
Every performance fix in this track is exactly one of three things: The seven habits below are the third category, applied at the PyTorch layer.

The seven habits

1 · async H2D / D2Hpin_memory + non_blocking 2 · allocatorcaching, fragmentation, OOM 3 · contiguous disciplineview vs copy, strides 4 · mixed precisionbf16 / fp8, master copy 5 · autograd hygienegrad clears, retain_graph, hooks 6 · dataloader overlapworkers, prefetch, sharding 7 · sync discipline.item(), print, logging + streams (advanced)compute/comm overlap The bridge All seven are about keeping the GPU fed without surprising it. Lessons 18–23 made each kernel fast. This lesson keeps the spaces between kernels fast, so the timeline you saw in lesson 20 actually has no gaps.

1 · Async host↔device transfer

Copying a CPU tensor to GPU is two steps: pin it (so DMA can grab it without an extra copy), then issue an async transfer. Both have to be set up correctly or you get a sync.

# dataloader side: pin host memory
loader = DataLoader(dataset, batch_size=B, num_workers=4, pin_memory=True)

# transfer side: non-blocking copy
for batch in loader:
    x = batch['x'].to('cuda', non_blocking=True)   # async
    y = batch['y'].to('cuda', non_blocking=True)
    # do GPU work concurrently with the next H2D
    out = model(x)
    loss = loss_fn(out, y)
    loss.backward()

Without pin_memory=True the runtime makes a hidden synchronous copy through a pinned staging buffer. Without non_blocking=True the call waits for completion. The two together overlap H2D with compute.

2 · The caching allocator, and how to not anger it

PyTorch never returns memory to the driver between operations. It caches freed blocks and reuses them. Two failure modes:

ToolWhat it tells you
torch.cuda.memory_allocated()Bytes the program is using.
torch.cuda.memory_reserved()Bytes the allocator has from the driver. The "high-water."
torch.cuda.memory_summary()Per-pool stats, allocations, frees.
torch.cuda.memory._record_memory_history() + _dump_snapshot()Replayable timeline of every alloc/free with stack traces. Open in pytorch.org/memory_viz.
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:TrueAllocator strategy that defragments better for variable shapes.

Practical rules: avoid creating long-lived lists of tensors you no longer need; del them or rebind. Use torch.no_grad() in inference paths so activations aren't held for backward. Beware giant model-init transients (some checkpoint loaders briefly hold 2× the model weight in memory).

3 · Contiguous discipline (view vs copy)

Many ops return views: a new tensor that points into the same storage with different strides. Free in compute, but downstream ops that demand contiguity (conv, matmul in some paths, reshape across non-contiguous strides) call .contiguous() which is a copy. The copy can be ~30 % of an op's time, silently.

OpReturns view?
x.reshape(...)View if possible, else copy.
x.view(...)View; errors if strides don't allow it.
x.transpose / permuteView (non-contiguous strides).
x.contiguous()Copy unless already contiguous.
x.expand(...)View (zero-stride). Free broadcasting.
x.repeat(...)Copy. Often the wrong choice — use expand if you can.
Indexing with a list/tensorCopy (gather).
Slicing with :View.

Channels-last is the same idea applied to convolution memory format. For 4D conv on Hopper, x = x.to(memory_format=torch.channels_last) on inputs + weights routes to faster cuDNN kernels. For LLM serving (no conv) this rarely matters.

4 · Mixed precision, done correctly

Two patterns coexist in modern training:

from torch.amp import autocast, GradScaler

scaler = GradScaler('cuda')   # only needed for fp16, not bf16

for x, y in loader:
    with autocast('cuda', dtype=torch.bfloat16):
        out = model(x)
        loss = loss_fn(out, y)
    loss.backward()           # GradScaler-free path; bf16 doesn't underflow like fp16
    opt.step(); opt.zero_grad()

Three gotchas: (1) cast inputs before autocast for predictable behavior; (2) loss functions that include large reductions stay fp32 automatically — don't fight it; (3) save weights as fp32 master copies; the autocast layer downcasts at compute time.

5 · Autograd hygiene

6 · Dataloader overlap

If your GPU has idle gaps at the start of each iteration, you're dataloader-bound, not model-bound. Three settings dominate:

Diagnostics: in nsys, look at the CPU row at the start of each iteration. If it's busy and the GPU is idle, the loader is the bottleneck.

7 · Sync discipline (the silent killer)

Re-stating from lesson 18 because it appears in every "why is my training slow" thread:

CodeEffect
print(loss)Implicit .cpu(). Sync.
logger.info(f"loss={loss}")Same. f-string formats the tensor → .item().
writer.add_scalar('loss', loss.item(), step).item() → sync. Aggregate & log every N steps.
if loss > threshold:Sync to evaluate the branch.
tensor.numpy()Sync.
torch.cuda.synchronize()Explicit. Use in benchmarks, not in hot loops.

Pattern: keep all logging tensors on GPU until end of step, then batch-sync. A single torch.cuda.synchronize() at the iteration boundary is fine; ten throughout the body is not.

Advanced · streams for compute/comm overlap

By default every op goes on the default CUDA stream. For overlap (e.g., next layer's compute while previous layer's all-reduce is in flight), use multiple streams:

comp_stream = torch.cuda.Stream()
comm_stream = torch.cuda.Stream()

# IMPORTANT: comm_stream must wait for grad to be produced on the default stream
# before starting the all_reduce. Without this, the all_reduce reads garbage.
comm_stream.wait_stream(torch.cuda.current_stream())

with torch.cuda.stream(comm_stream):
    work = dist.all_reduce(grad, async_op=True)   # comm on its own stream

with torch.cuda.stream(comp_stream):
    next_layer_forward(...)                        # compute concurrently

# at sync barriers, the current stream waits for both producers
torch.cuda.current_stream().wait_stream(comp_stream)
torch.cuda.current_stream().wait_stream(comm_stream)
work.wait()

This is the foundation of "comm/compute overlap" in distributed training. Most users get it via framework features (FSDP's backward_prefetch, DDP's bucket overlap) rather than writing it directly — but knowing it exists explains why DDP wires things the way it does.

The synthesis: one decision flow

profile (20)torch.profiler / nsys / ncu CPU gaps?dispatch · sync · dataloader launches dominate?small kernels + many launches specific kernel slow?below its roofline memory pressure?OOM / fragmentation fix at PyTorch layer (24)sync discipline · dataloader · streams compile / graph (23, 15)fuse + CUDA graph kernel work (22, 21)Triton / library / read ncu allocator / quant (24, 16)expandable_segments · bf16/fp8 Every fix in this track maps to one of these four boxes. Profile first; never guess.

Top-10 checklist for "make my PyTorch model faster"

  1. Wrap model = torch.compile(model). Re-measure after warmup.
  2. Confirm num_workers > 0 and pin_memory=True, non_blocking=True on H2D copies.
  3. Remove debug prints / .item() from the hot loop. Log every N steps if needed.
  4. Use autocast(dtype=bfloat16) for the forward; keep optimizer in fp32.
  5. Replace x.repeat(...) with x.expand(...) where possible.
  6. Set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True for variable shapes.
  7. Profile with torch.profiler — sort by CUDA time, look at top 5.
  8. If decode-step launch-bound, add mode="reduce-overhead" to torch.compile.
  9. If a specific kernel is slow, take it to ncu (lesson 21) before rewriting.
  10. Benchmark median of 50 steps after a warmup of 10. Don't trust single runs.

Interactive · end-to-end "what should I try first?"

Describe your current bottleneck profile. The widget points you to the lesson and the lever.

Synthesis: where to spend the next hour

Move the sliders to match what your profile says. The output names the top action — pulled from the right lesson.

Closing the track

Twenty-four lessons. From "what is a thread" to "what to try first when my model is slow." One unifying object runs through all of them: the roofline from lesson 01. Every other concept is a way of moving a kernel toward, along, or off its roof.

The arc, retraced:

Final mental model
Every performance fix is one of three things: fewer bytes moved (fusion, quantization, KV reuse), more useful bytes on chip (tile size, occupancy, async copy), or less work between kernels (compile, graphs, sync discipline). When you encounter "X is slow," name which of the three it is — that names the lesson, which names the lever.

If you remember nothing else: profile before guessing, count bytes before naming a kernel, and place the bar on the roofline. The rest is variations on those three habits.