Performant PyTorch patterns & synthesis
Compile, Triton, and profilers fix the kernels. This lesson fixes everything around the kernels: memory format, allocator pressure, async data movement, mixed precision, and the half-dozen tiny patterns that decide whether a healthy kernel actually runs healthy in your program.
The question this lesson answers
Your kernels are fast. The compiler is on. The profile says GPU utilization is 65 %. Where is the other 35 % going? Usually it's PyTorch-level: a host stall in the dataloader, an allocator hiccup, a needless contiguous(), an autograd graph held longer than it should be. This lesson is a checklist for the patterns that move that 35 %.
- Fewer bytes moved — fusion (19, 22, 23), prefix reuse (14), quantization (16), KV layout (13).
- More useful bytes on chip — tiling (05), tensor cores (09), better tile size, async copy.
- Less work between kernels — CUDA graphs (15),
torch.compile(23), sync discipline (this lesson), allocator hygiene (this lesson).
The seven habits
1 · Async host↔device transfer
Copying a CPU tensor to GPU is two steps: pin it (so DMA can grab it without an extra copy), then issue an async transfer. Both have to be set up correctly or you get a sync.
# dataloader side: pin host memory
loader = DataLoader(dataset, batch_size=B, num_workers=4, pin_memory=True)
# transfer side: non-blocking copy
for batch in loader:
x = batch['x'].to('cuda', non_blocking=True) # async
y = batch['y'].to('cuda', non_blocking=True)
# do GPU work concurrently with the next H2D
out = model(x)
loss = loss_fn(out, y)
loss.backward()
Without pin_memory=True the runtime makes a hidden synchronous copy through a pinned staging buffer. Without non_blocking=True the call waits for completion. The two together overlap H2D with compute.
2 · The caching allocator, and how to not anger it
PyTorch never returns memory to the driver between operations. It caches freed blocks and reuses them. Two failure modes:
- Fragmentation. Many small allocations of varied sizes leave gaps the allocator can't reuse for big requests → OOM at a memory utilization well below the GPU's capacity.
- Excess high-water. Once the allocator reserves N GB for a peak, it holds it until the process ends.
| Tool | What it tells you |
|---|---|
torch.cuda.memory_allocated() | Bytes the program is using. |
torch.cuda.memory_reserved() | Bytes the allocator has from the driver. The "high-water." |
torch.cuda.memory_summary() | Per-pool stats, allocations, frees. |
torch.cuda.memory._record_memory_history() + _dump_snapshot() | Replayable timeline of every alloc/free with stack traces. Open in pytorch.org/memory_viz. |
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True | Allocator strategy that defragments better for variable shapes. |
Practical rules: avoid creating long-lived lists of tensors you no longer need; del them or rebind. Use torch.no_grad() in inference paths so activations aren't held for backward. Beware giant model-init transients (some checkpoint loaders briefly hold 2× the model weight in memory).
3 · Contiguous discipline (view vs copy)
Many ops return views: a new tensor that points into the same storage with different strides. Free in compute, but downstream ops that demand contiguity (conv, matmul in some paths, reshape across non-contiguous strides) call .contiguous() which is a copy. The copy can be ~30 % of an op's time, silently.
| Op | Returns view? |
|---|---|
x.reshape(...) | View if possible, else copy. |
x.view(...) | View; errors if strides don't allow it. |
x.transpose / permute | View (non-contiguous strides). |
x.contiguous() | Copy unless already contiguous. |
x.expand(...) | View (zero-stride). Free broadcasting. |
x.repeat(...) | Copy. Often the wrong choice — use expand if you can. |
| Indexing with a list/tensor | Copy (gather). |
Slicing with : | View. |
Channels-last is the same idea applied to convolution memory format. For 4D conv on Hopper, x = x.to(memory_format=torch.channels_last) on inputs + weights routes to faster cuDNN kernels. For LLM serving (no conv) this rarely matters.
4 · Mixed precision, done correctly
Two patterns coexist in modern training:
- AMP / autocast (bf16): matmuls and convolutions run in bf16; reductions and norms stay fp32. Cheap, robust, almost always a win.
- fp8 (e4m3/e5m2): tensor-core fp8 GEMM on Hopper. Needs scale management and per-tensor calibration. ~2× over bf16 in throughput when it works.
from torch.amp import autocast, GradScaler
scaler = GradScaler('cuda') # only needed for fp16, not bf16
for x, y in loader:
with autocast('cuda', dtype=torch.bfloat16):
out = model(x)
loss = loss_fn(out, y)
loss.backward() # GradScaler-free path; bf16 doesn't underflow like fp16
opt.step(); opt.zero_grad()
Three gotchas: (1) cast inputs before autocast for predictable behavior; (2) loss functions that include large reductions stay fp32 automatically — don't fight it; (3) save weights as fp32 master copies; the autocast layer downcasts at compute time.
5 · Autograd hygiene
opt.zero_grad(set_to_none=True)— sets.gradtoNoneinstead of zero-tensoring. Saves an entire elementwise op per parameter per step. Default in modern PyTorch.- Don't
retain_graph=Trueunless you need it. Holds onto activations and turns one step into many. - Hooks (forward/backward) leak. Returning a closure from a
register_forward_hookcan hold the entire module. Use weak refs or remove the hook explicitly. torch.no_grad()on eval paths — turns off graph tape, halves activation memory.checkpoint(...)trades recomputation for activation memory. Useful at long context lengths.
6 · Dataloader overlap
If your GPU has idle gaps at the start of each iteration, you're dataloader-bound, not model-bound. Three settings dominate:
num_workers— usually2 × number of GPUsormin(8, cpu_count // num_gpus). More workers ≠ better; each one is a CPU process.prefetch_factor(default 2) — batches per worker to keep queued.persistent_workers=True— avoids re-forking workers each epoch, saving seconds at small-epoch training.
Diagnostics: in nsys, look at the CPU row at the start of each iteration. If it's busy and the GPU is idle, the loader is the bottleneck.
7 · Sync discipline (the silent killer)
Re-stating from lesson 18 because it appears in every "why is my training slow" thread:
| Code | Effect |
|---|---|
print(loss) | Implicit .cpu(). Sync. |
logger.info(f"loss={loss}") | Same. f-string formats the tensor → .item(). |
writer.add_scalar('loss', loss.item(), step) | .item() → sync. Aggregate & log every N steps. |
if loss > threshold: | Sync to evaluate the branch. |
tensor.numpy() | Sync. |
torch.cuda.synchronize() | Explicit. Use in benchmarks, not in hot loops. |
Pattern: keep all logging tensors on GPU until end of step, then batch-sync. A single torch.cuda.synchronize() at the iteration boundary is fine; ten throughout the body is not.
Advanced · streams for compute/comm overlap
By default every op goes on the default CUDA stream. For overlap (e.g., next layer's compute while previous layer's all-reduce is in flight), use multiple streams:
comp_stream = torch.cuda.Stream()
comm_stream = torch.cuda.Stream()
# IMPORTANT: comm_stream must wait for grad to be produced on the default stream
# before starting the all_reduce. Without this, the all_reduce reads garbage.
comm_stream.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(comm_stream):
work = dist.all_reduce(grad, async_op=True) # comm on its own stream
with torch.cuda.stream(comp_stream):
next_layer_forward(...) # compute concurrently
# at sync barriers, the current stream waits for both producers
torch.cuda.current_stream().wait_stream(comp_stream)
torch.cuda.current_stream().wait_stream(comm_stream)
work.wait()
This is the foundation of "comm/compute overlap" in distributed training. Most users get it via framework features (FSDP's backward_prefetch, DDP's bucket overlap) rather than writing it directly — but knowing it exists explains why DDP wires things the way it does.
The synthesis: one decision flow
Top-10 checklist for "make my PyTorch model faster"
- Wrap
model = torch.compile(model). Re-measure after warmup. - Confirm
num_workers> 0 andpin_memory=True,non_blocking=Trueon H2D copies. - Remove debug prints /
.item()from the hot loop. Log every N steps if needed. - Use
autocast(dtype=bfloat16)for the forward; keep optimizer in fp32. - Replace
x.repeat(...)withx.expand(...)where possible. - Set
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:Truefor variable shapes. - Profile with
torch.profiler— sort by CUDA time, look at top 5. - If decode-step launch-bound, add
mode="reduce-overhead"totorch.compile. - If a specific kernel is slow, take it to ncu (lesson 21) before rewriting.
- Benchmark median of 50 steps after a warmup of 10. Don't trust single runs.
Interactive · end-to-end "what should I try first?"
Describe your current bottleneck profile. The widget points you to the lesson and the lever.
Closing the track
Twenty-four lessons. From "what is a thread" to "what to try first when my model is slow." One unifying object runs through all of them: the roofline from lesson 01. Every other concept is a way of moving a kernel toward, along, or off its roof.
The arc, retraced:
- Part I (01–09) built the hardware vocabulary — threads, warps, SMs, memory tiers, tensor cores. Each lesson named one resource and showed how kernels can saturate or waste it.
- Part II (10–17) applied those primitives to the specific question of LLM serving. PagedAttention is "tile + indirection." FlashAttention is "tile + online reduction." Continuous batching is "make warps eligible." Cache-aware routing is "amortize HBM across requests." Every serving abstraction reduces to a hardware lever from Part I.
- Part III (18–24) gave you the loop — translate a PyTorch line to the kernels it launches, profile, classify the bottleneck on the roofline, fix at the right layer, re-measure. Triton when you need a new kernel,
torch.compilewhen you need many, PyTorch patterns when the gap is between kernels.
If you remember nothing else: profile before guessing, count bytes before naming a kernel, and place the bar on the roofline. The rest is variations on those three habits.