Profiling toolkit: three views

Three tools answer three different questions. torch.profiler asks "which line is slow." nsys asks "where are the gaps." ncu asks "why is this kernel slow." Picking the wrong one wastes hours; picking the right one cuts the loop.

The question this lesson answers

"My training step is slow" is not a question a profiler can answer — it is the symptom you walk back from. The actionable question is one of: which CPU/GPU work is on the critical path, where the gaps and syncs hide, and why a specific kernel runs below its roofline. Each tool exists for one of those questions.

The three-tool map

What each tool surfaces

Tool	Captures	Best for	Misses
`torch.profiler`	Python op stack, CUDA kernels per op, memory allocations.	"This `F.gelu` call is the slow one." Op-level attribution. Memory snapshots.	Sub-kernel detail; cross-process activity; sync sources outside autograd.
`nsys`	CPU threads, CUDA streams, kernel launches, NCCL, NVLink, memory copies. Across all processes.	Idle gaps, sync chains, multi-GPU overlap, dataloader stalls.	Why a specific kernel is slow (just shows duration, not internal metrics).
`ncu`	Per-kernel: roofline, achieved occupancy, memory throughput, warp stall reasons, register pressure.	"This GEMM hits 40 % of peak — why?" The kernel-author's microscope.	Anything outside the captured kernel; cross-process timing.

How each one is invoked (cheat sheet)

# torch.profiler — wrap a few steps in your training loop
import torch
from torch.profiler import profile, ProfilerActivity, schedule

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    schedule=schedule(wait=1, warmup=1, active=3, repeat=1),
    on_trace_ready=torch.profiler.tensorboard_trace_handler('./tb'),
    record_shapes=True, with_stack=True, profile_memory=True,
) as prof:
    for step in range(6):
        train_step()
        prof.step()

# Nsight Systems — wrap the whole run from the shell
nsys profile -t cuda,nvtx,osrt -o run --capture-range=cudaProfilerApi python train.py

# Nsight Compute — replay one specific kernel
ncu --set full --target-processes all --kernel-name "rmsnorm_fwd" \
    --launch-skip 50 --launch-count 1 python bench.py

The headline metrics, by tool

Metric	Where to find it	What it tells you
CUDA time / op	torch.profiler table	Which PyTorch line consumes the most kernel time.
CPU time / op	torch.profiler table	If CPU time ≈ CUDA time, you're launch-bound (back to lesson 18).
GPU active %	nsys timeline summary	<80 % → gaps. Look for sync calls, host work, NCCL waits.
NCCL kernel duration	nsys CUDA HW row	Distinguishes "comm is slow" from "comm overlaps badly."
DRAM throughput	ncu "Memory Workload Analysis"	% of HBM peak this kernel achieves. <50 % → memory-side opportunity.
SM throughput	ncu "Compute Workload Analysis"	% of peak FLOPs. Combined with DRAM, places you on the roofline (lesson 21).
Achieved Occupancy	ncu	% of warps the SM is keeping in flight. Low → register/SMEM pressure or tiny grids.
Warp Stall Reasons	ncu "Scheduler Statistics"	Names the dominant stall: MIO Throttle (memory queue), Long Scoreboard (HBM latency), No Eligible (warp underflow), etc.
Register / Thread	ncu launch stats	>128 limits occupancy; actual spills happen above the 255 per-thread cap (or below if compiled with `-maxrregcount`).

NVTX: making profiles readable

By default, profiles label work as "aten::matmul"-shaped strings. That is fine for one op but useless across a layer. NVTX ranges let you annotate code regions; both nsys and torch.profiler render them as named bars on the timeline. Use them whenever you have logical phases the tool can't infer.

import torch.cuda.nvtx as nvtx

nvtx.range_push("transformer_block")
out = self.norm1(x)
nvtx.range_push("attention")
out = self.attn(out)
nvtx.range_pop()
nvtx.range_push("mlp")
out = self.mlp(out)
nvtx.range_pop()
nvtx.range_pop()

Reading a torch.profiler table

After capture, this gives the most useful first look:

print(prof.key_averages().table(
    sort_by="cuda_time_total", row_limit=20
))

What to scan for:

Top 3 rows by CUDA time: usually attention + GEMMs. If something unexpected is here (a copy, a sort, a cast), that's your easy win.
CPU total ≫ CUDA total: launch-bound. Compile or graph it.
# of Calls: a tiny op called 10⁵ times can dominate. Often a Python loop you didn't notice.
Input Shapes (with record_shapes=True): the same op with two different shapes shows as two rows. Useful for catching shape-driven recompiles.

Reading an nsys timeline

You're looking at three rows that matter:

OS runtime / CPU: Python stack + libc. Big white spans here = the CPU is busy on host code (dataloader, tokenizer, etc.).
CUDA API: when the CPU is calling into CUDA (cudaLaunchKernel, cudaMemcpy, etc.). Dense here = lots of dispatch.
CUDA HW (per stream): the actual GPU work. Gaps here = the GPU is waiting on the CPU or on sync. The fundamental question is "are the gaps long enough to matter."

Read it like a music score

Find the longest stretch of GPU-idle time. Move your eye up to the CPU row at the same instant. The CPU activity at that moment is what's blocking the GPU.

Reading an ncu report

Open the kernel in question. The top section ("GPU Speed Of Light") gives the headline: percent of peak compute and percent of peak memory throughput. Together they place the kernel on the roofline (lesson 21).

If both are low (under ~30 %), the kernel is stalled. Then "Scheduler Statistics" names the stall:

MIO Throttle / Long Scoreboard: waiting on HBM. Fix with better locality, tiling, or fewer reads.
Short Scoreboard: waiting on SMEM or constant memory. Often poor bank-conflict-free access.
No Eligible: not enough warps in flight. Lower register pressure or raise num_warps.
Wait / Sync: __syncthreads() imbalance.
Selected: healthy — the warp ran.

Cost & overhead, honestly

Tool	Overhead	When to keep it on
`torch.profiler` w/o memory	~5–10 %	Always fine for a few warmup+active steps.
`torch.profiler` w/ `profile_memory` & `with_stack`	~30–80 %	Once, to find a specific allocation. Don't ship.
`nsys`	~2–10 %	Production-like reproductions. Capture <30 s of run.
`ncu --set full`	~5–50× per kernel	One kernel at a time, `--launch-count 1`. Never for "profile the whole training step."

Interactive · pick the right tool

Describe your symptom; the widget names the first tool to reach for. This isn't a real classifier; it's a memorization aid.

What this gives you for the next lesson

You can collect data; lesson 21 teaches you what the data means. We'll take a few realistic kernel snapshots and walk them onto the roofline (lesson 01) to decide whether the lever is bandwidth, compute, or launch overhead — and what to do in each case.