Profiling toolkit: three views
Three tools answer three different questions. torch.profiler asks "which line is slow." nsys asks "where are the gaps." ncu asks "why is this kernel slow." Picking the wrong one wastes hours; picking the right one cuts the loop.
The question this lesson answers
"My training step is slow" is not a question a profiler can answer — it is the symptom you walk back from. The actionable question is one of: which CPU/GPU work is on the critical path, where the gaps and syncs hide, and why a specific kernel runs below its roofline. Each tool exists for one of those questions.
The three-tool map
What each tool surfaces
| Tool | Captures | Best for | Misses |
|---|---|---|---|
torch.profiler | Python op stack, CUDA kernels per op, memory allocations. | "This F.gelu call is the slow one." Op-level attribution. Memory snapshots. | Sub-kernel detail; cross-process activity; sync sources outside autograd. |
nsys | CPU threads, CUDA streams, kernel launches, NCCL, NVLink, memory copies. Across all processes. | Idle gaps, sync chains, multi-GPU overlap, dataloader stalls. | Why a specific kernel is slow (just shows duration, not internal metrics). |
ncu | Per-kernel: roofline, achieved occupancy, memory throughput, warp stall reasons, register pressure. | "This GEMM hits 40 % of peak — why?" The kernel-author's microscope. | Anything outside the captured kernel; cross-process timing. |
How each one is invoked (cheat sheet)
# torch.profiler — wrap a few steps in your training loop
import torch
from torch.profiler import profile, ProfilerActivity, schedule
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
schedule=schedule(wait=1, warmup=1, active=3, repeat=1),
on_trace_ready=torch.profiler.tensorboard_trace_handler('./tb'),
record_shapes=True, with_stack=True, profile_memory=True,
) as prof:
for step in range(6):
train_step()
prof.step()
# Nsight Systems — wrap the whole run from the shell
nsys profile -t cuda,nvtx,osrt -o run --capture-range=cudaProfilerApi python train.py
# Nsight Compute — replay one specific kernel
ncu --set full --target-processes all --kernel-name "rmsnorm_fwd" \
--launch-skip 50 --launch-count 1 python bench.py
The headline metrics, by tool
| Metric | Where to find it | What it tells you |
|---|---|---|
| CUDA time / op | torch.profiler table | Which PyTorch line consumes the most kernel time. |
| CPU time / op | torch.profiler table | If CPU time ≈ CUDA time, you're launch-bound (back to lesson 18). |
| GPU active % | nsys timeline summary | <80 % → gaps. Look for sync calls, host work, NCCL waits. |
| NCCL kernel duration | nsys CUDA HW row | Distinguishes "comm is slow" from "comm overlaps badly." |
| DRAM throughput | ncu "Memory Workload Analysis" | % of HBM peak this kernel achieves. <50 % → memory-side opportunity. |
| SM throughput | ncu "Compute Workload Analysis" | % of peak FLOPs. Combined with DRAM, places you on the roofline (lesson 21). |
| Achieved Occupancy | ncu | % of warps the SM is keeping in flight. Low → register/SMEM pressure or tiny grids. |
| Warp Stall Reasons | ncu "Scheduler Statistics" | Names the dominant stall: MIO Throttle (memory queue), Long Scoreboard (HBM latency), No Eligible (warp underflow), etc. |
| Register / Thread | ncu launch stats | >128 limits occupancy; actual spills happen above the 255 per-thread cap (or below if compiled with -maxrregcount). |
NVTX: making profiles readable
By default, profiles label work as "aten::matmul"-shaped strings. That is fine for one op but useless across a layer. NVTX ranges let you annotate code regions; both nsys and torch.profiler render them as named bars on the timeline. Use them whenever you have logical phases the tool can't infer.
import torch.cuda.nvtx as nvtx
nvtx.range_push("transformer_block")
out = self.norm1(x)
nvtx.range_push("attention")
out = self.attn(out)
nvtx.range_pop()
nvtx.range_push("mlp")
out = self.mlp(out)
nvtx.range_pop()
nvtx.range_pop()
Reading a torch.profiler table
After capture, this gives the most useful first look:
print(prof.key_averages().table(
sort_by="cuda_time_total", row_limit=20
))
What to scan for:
- Top 3 rows by
CUDA time: usually attention + GEMMs. If something unexpected is here (a copy, a sort, a cast), that's your easy win. CPU total≫CUDA total: launch-bound. Compile or graph it.# of Calls: a tiny op called 10⁵ times can dominate. Often a Python loop you didn't notice.Input Shapes(withrecord_shapes=True): the same op with two different shapes shows as two rows. Useful for catching shape-driven recompiles.
Reading an nsys timeline
You're looking at three rows that matter:
- OS runtime / CPU: Python stack + libc. Big white spans here = the CPU is busy on host code (dataloader, tokenizer, etc.).
- CUDA API: when the CPU is calling into CUDA (cudaLaunchKernel, cudaMemcpy, etc.). Dense here = lots of dispatch.
- CUDA HW (per stream): the actual GPU work. Gaps here = the GPU is waiting on the CPU or on sync. The fundamental question is "are the gaps long enough to matter."
Reading an ncu report
Open the kernel in question. The top section ("GPU Speed Of Light") gives the headline: percent of peak compute and percent of peak memory throughput. Together they place the kernel on the roofline (lesson 21).
If both are low (under ~30 %), the kernel is stalled. Then "Scheduler Statistics" names the stall:
- MIO Throttle / Long Scoreboard: waiting on HBM. Fix with better locality, tiling, or fewer reads.
- Short Scoreboard: waiting on SMEM or constant memory. Often poor bank-conflict-free access.
- No Eligible: not enough warps in flight. Lower register pressure or raise
num_warps. - Wait / Sync:
__syncthreads()imbalance. - Selected: healthy — the warp ran.
Cost & overhead, honestly
| Tool | Overhead | When to keep it on |
|---|---|---|
torch.profiler w/o memory | ~5–10 % | Always fine for a few warmup+active steps. |
torch.profiler w/ profile_memory & with_stack | ~30–80 % | Once, to find a specific allocation. Don't ship. |
nsys | ~2–10 % | Production-like reproductions. Capture <30 s of run. |
ncu --set full | ~5–50× per kernel | One kernel at a time, --launch-count 1. Never for "profile the whole training step." |
Interactive · pick the right tool
Describe your symptom; the widget names the first tool to reach for. This isn't a real classifier; it's a memorization aid.
What this gives you for the next lesson
You can collect data; lesson 21 teaches you what the data means. We'll take a few realistic kernel snapshots and walk them onto the roofline (lesson 01) to decide whether the lever is bandwidth, compute, or launch overhead — and what to do in each case.