Performant PyTorch patterns & synthesis

Compile, Triton, and profilers fix the kernels. This lesson fixes everything around the kernels: memory format, allocator pressure, async data movement, mixed precision, and the half-dozen tiny patterns that decide whether a healthy kernel actually runs healthy in your program.

The question this lesson answers

Your kernels are fast. The compiler is on. The profile says GPU utilization is 65 %. Where is the other 35 % going? Usually it's PyTorch-level: a host stall in the dataloader, an allocator hiccup, a needless contiguous(), an autograd graph held longer than it should be. This lesson is a checklist for the patterns that move that 35 %.

Mental model for the whole track

Every performance fix in this track is exactly one of three things:

Fewer bytes moved — fusion (19, 22, 23), prefix reuse (14), quantization (16), KV layout (13).
More useful bytes on chip — tiling (05), tensor cores (09), better tile size, async copy.
Less work between kernels — CUDA graphs (15), torch.compile (23), sync discipline (this lesson), allocator hygiene (this lesson).

The seven habits below are the third category, applied at the PyTorch layer.

The seven habits

between kernels fast, so the timeline you saw in lesson 20 actually has no gaps.

1 · Async host↔device transfer

Copying a CPU tensor to GPU is two steps: pin it (so DMA can grab it without an extra copy), then issue an async transfer. Both have to be set up correctly or you get a sync.

# dataloader side: pin host memory
loader = DataLoader(dataset, batch_size=B, num_workers=4, pin_memory=True)

# transfer side: non-blocking copy
for batch in loader:
    x = batch['x'].to('cuda', non_blocking=True)   # async
    y = batch['y'].to('cuda', non_blocking=True)
    # do GPU work concurrently with the next H2D
    out = model(x)
    loss = loss_fn(out, y)
    loss.backward()

Without pin_memory=True the runtime makes a hidden synchronous copy through a pinned staging buffer. Without non_blocking=True the call waits for completion. The two together overlap H2D with compute.

2 · The caching allocator, and how to not anger it

PyTorch never returns memory to the driver between operations. It caches freed blocks and reuses them. Two failure modes:

Fragmentation. Many small allocations of varied sizes leave gaps the allocator can't reuse for big requests → OOM at a memory utilization well below the GPU's capacity.
Excess high-water. Once the allocator reserves N GB for a peak, it holds it until the process ends.

Tool	What it tells you
`torch.cuda.memory_allocated()`	Bytes the program is using.
`torch.cuda.memory_reserved()`	Bytes the allocator has from the driver. The "high-water."
`torch.cuda.memory_summary()`	Per-pool stats, allocations, frees.
`torch.cuda.memory._record_memory_history()` + `_dump_snapshot()`	Replayable timeline of every alloc/free with stack traces. Open in pytorch.org/memory_viz.
`PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`	Allocator strategy that defragments better for variable shapes.

Practical rules: avoid creating long-lived lists of tensors you no longer need; del them or rebind. Use torch.no_grad() in inference paths so activations aren't held for backward. Beware giant model-init transients (some checkpoint loaders briefly hold 2× the model weight in memory).

3 · Contiguous discipline (view vs copy)

Many ops return views: a new tensor that points into the same storage with different strides. Free in compute, but downstream ops that demand contiguity (conv, matmul in some paths, reshape across non-contiguous strides) call .contiguous() which is a copy. The copy can be ~30 % of an op's time, silently.

Op	Returns view?
`x.reshape(...)`	View if possible, else copy.
`x.view(...)`	View; errors if strides don't allow it.
`x.transpose / permute`	View (non-contiguous strides).
`x.contiguous()`	Copy unless already contiguous.
`x.expand(...)`	View (zero-stride). Free broadcasting.
`x.repeat(...)`	Copy. Often the wrong choice — use `expand` if you can.
Indexing with a list/tensor	Copy (gather).
Slicing with `:`	View.

Channels-last is the same idea applied to convolution memory format. For 4D conv on Hopper, x = x.to(memory_format=torch.channels_last) on inputs + weights routes to faster cuDNN kernels. For LLM serving (no conv) this rarely matters.

4 · Mixed precision, done correctly

Two patterns coexist in modern training:

AMP / autocast (bf16): matmuls and convolutions run in bf16; reductions and norms stay fp32. Cheap, robust, almost always a win.
fp8 (e4m3/e5m2): tensor-core fp8 GEMM on Hopper. Needs scale management and per-tensor calibration. ~2× over bf16 in throughput when it works.

from torch.amp import autocast, GradScaler

scaler = GradScaler('cuda')   # only needed for fp16, not bf16

for x, y in loader:
    with autocast('cuda', dtype=torch.bfloat16):
        out = model(x)
        loss = loss_fn(out, y)
    loss.backward()           # GradScaler-free path; bf16 doesn't underflow like fp16
    opt.step(); opt.zero_grad()

Three gotchas: (1) cast inputs before autocast for predictable behavior; (2) loss functions that include large reductions stay fp32 automatically — don't fight it; (3) save weights as fp32 master copies; the autocast layer downcasts at compute time.

5 · Autograd hygiene

opt.zero_grad(set_to_none=True) — sets .grad to None instead of zero-tensoring. Saves an entire elementwise op per parameter per step. Default in modern PyTorch.
Don't retain_graph=True unless you need it. Holds onto activations and turns one step into many.
Hooks (forward/backward) leak. Returning a closure from a register_forward_hook can hold the entire module. Use weak refs or remove the hook explicitly.
torch.no_grad() on eval paths — turns off graph tape, halves activation memory.
checkpoint(...) trades recomputation for activation memory. Useful at long context lengths.

6 · Dataloader overlap

If your GPU has idle gaps at the start of each iteration, you're dataloader-bound, not model-bound. Three settings dominate:

num_workers — usually 2 × number of GPUs or min(8, cpu_count // num_gpus). More workers ≠ better; each one is a CPU process.
prefetch_factor (default 2) — batches per worker to keep queued.
persistent_workers=True — avoids re-forking workers each epoch, saving seconds at small-epoch training.

Diagnostics: in nsys, look at the CPU row at the start of each iteration. If it's busy and the GPU is idle, the loader is the bottleneck.

7 · Sync discipline (the silent killer)

Re-stating from lesson 18 because it appears in every "why is my training slow" thread:

Code	Effect
`print(loss)`	Implicit `.cpu()`. Sync.
`logger.info(f"loss={loss}")`	Same. f-string formats the tensor → `.item()`.
`writer.add_scalar('loss', loss.item(), step)`	`.item()` → sync. Aggregate & log every N steps.
`if loss > threshold:`	Sync to evaluate the branch.
`tensor.numpy()`	Sync.
`torch.cuda.synchronize()`	Explicit. Use in benchmarks, not in hot loops.

Pattern: keep all logging tensors on GPU until end of step, then batch-sync. A single torch.cuda.synchronize() at the iteration boundary is fine; ten throughout the body is not.

Advanced · streams for compute/comm overlap

By default every op goes on the default CUDA stream. For overlap (e.g., next layer's compute while previous layer's all-reduce is in flight), use multiple streams:

comp_stream = torch.cuda.Stream()
comm_stream = torch.cuda.Stream()

# IMPORTANT: comm_stream must wait for grad to be produced on the default stream
# before starting the all_reduce. Without this, the all_reduce reads garbage.
comm_stream.wait_stream(torch.cuda.current_stream())

with torch.cuda.stream(comm_stream):
    work = dist.all_reduce(grad, async_op=True)   # comm on its own stream

with torch.cuda.stream(comp_stream):
    next_layer_forward(...)                        # compute concurrently

# at sync barriers, the current stream waits for both producers
torch.cuda.current_stream().wait_stream(comp_stream)
torch.cuda.current_stream().wait_stream(comm_stream)
work.wait()

This is the foundation of "comm/compute overlap" in distributed training. Most users get it via framework features (FSDP's backward_prefetch, DDP's bucket overlap) rather than writing it directly — but knowing it exists explains why DDP wires things the way it does.

The synthesis: one decision flow

Top-10 checklist for "make my PyTorch model faster"

Wrap model = torch.compile(model). Re-measure after warmup.
Confirm num_workers > 0 and pin_memory=True, non_blocking=True on H2D copies.
Remove debug prints / .item() from the hot loop. Log every N steps if needed.
Use autocast(dtype=bfloat16) for the forward; keep optimizer in fp32.
Replace x.repeat(...) with x.expand(...) where possible.
Set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True for variable shapes.
Profile with torch.profiler — sort by CUDA time, look at top 5.
If decode-step launch-bound, add mode="reduce-overhead" to torch.compile.
If a specific kernel is slow, take it to ncu (lesson 21) before rewriting.
Benchmark median of 50 steps after a warmup of 10. Don't trust single runs.

Interactive · end-to-end "what should I try first?"

Describe your current bottleneck profile. The widget points you to the lesson and the lever.

Closing the track

Twenty-four lessons. From "what is a thread" to "what to try first when my model is slow." One unifying object runs through all of them: the roofline from lesson 01. Every other concept is a way of moving a kernel toward, along, or off its roof.

The arc, retraced:

Part I (01–09) built the hardware vocabulary — threads, warps, SMs, memory tiers, tensor cores. Each lesson named one resource and showed how kernels can saturate or waste it.
Part II (10–17) applied those primitives to the specific question of LLM serving. PagedAttention is "tile + indirection." FlashAttention is "tile + online reduction." Continuous batching is "make warps eligible." Cache-aware routing is "amortize HBM across requests." Every serving abstraction reduces to a hardware lever from Part I.
Part III (18–24) gave you the loop — translate a PyTorch line to the kernels it launches, profile, classify the bottleneck on the roofline, fix at the right layer, re-measure. Triton when you need a new kernel, torch.compile when you need many, PyTorch patterns when the gap is between kernels.

Final mental model

Every performance fix is one of three things: fewer bytes moved (fusion, quantization, KV reuse), more useful bytes on chip (tile size, occupancy, async copy), or less work between kernels (compile, graphs, sync discipline). When you encounter "X is slow," name which of the three it is — that names the lesson, which names the lever.

If you remember nothing else: profile before guessing, count bytes before naming a kernel, and place the bar on the roofline. The rest is variations on those three habits.