all_lessons/gpu_kernel_serving/20 · profiling toolkitlesson 20 / 24

Profiling toolkit: three views

Three tools answer three different questions. torch.profiler asks "which line is slow." nsys asks "where are the gaps." ncu asks "why is this kernel slow." Picking the wrong one wastes hours; picking the right one cuts the loop.

The question this lesson answers

"My training step is slow" is not a question a profiler can answer — it is the symptom you walk back from. The actionable question is one of: which CPU/GPU work is on the critical path, where the gaps and syncs hide, and why a specific kernel runs below its roofline. Each tool exists for one of those questions.

The three-tool map

torch.profilerPython ops ↔ CUDA kernelsflame graph, op-level table Nsight Systems (nsys)timeline · CPU streams · NVLinkNCCL, NVTX ranges, gaps Nsight Compute (ncu)per-kernel deep diveoccupancy · stalls · throughput "Which line is slow?" "Where are the gaps?" "Why is this kernel slow?" scopesingle Python processlow overheadaggregate & per-op statsruns in CI/notebooks scopesystem-wide timelinemulti-GPU, multi-processmicrosecond resolutionoffline GUI / qdrep file scopeone kernel at a timereplays the kernel~5–50× slowdown during capturehundreds of counters

What each tool surfaces

ToolCapturesBest forMisses
torch.profilerPython op stack, CUDA kernels per op, memory allocations."This F.gelu call is the slow one." Op-level attribution. Memory snapshots.Sub-kernel detail; cross-process activity; sync sources outside autograd.
nsysCPU threads, CUDA streams, kernel launches, NCCL, NVLink, memory copies. Across all processes.Idle gaps, sync chains, multi-GPU overlap, dataloader stalls.Why a specific kernel is slow (just shows duration, not internal metrics).
ncuPer-kernel: roofline, achieved occupancy, memory throughput, warp stall reasons, register pressure."This GEMM hits 40 % of peak — why?" The kernel-author's microscope.Anything outside the captured kernel; cross-process timing.

How each one is invoked (cheat sheet)

# torch.profiler — wrap a few steps in your training loop
import torch
from torch.profiler import profile, ProfilerActivity, schedule

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    schedule=schedule(wait=1, warmup=1, active=3, repeat=1),
    on_trace_ready=torch.profiler.tensorboard_trace_handler('./tb'),
    record_shapes=True, with_stack=True, profile_memory=True,
) as prof:
    for step in range(6):
        train_step()
        prof.step()

# Nsight Systems — wrap the whole run from the shell
nsys profile -t cuda,nvtx,osrt -o run --capture-range=cudaProfilerApi python train.py

# Nsight Compute — replay one specific kernel
ncu --set full --target-processes all --kernel-name "rmsnorm_fwd" \
    --launch-skip 50 --launch-count 1 python bench.py

The headline metrics, by tool

MetricWhere to find itWhat it tells you
CUDA time / optorch.profiler tableWhich PyTorch line consumes the most kernel time.
CPU time / optorch.profiler tableIf CPU time ≈ CUDA time, you're launch-bound (back to lesson 18).
GPU active %nsys timeline summary<80 % → gaps. Look for sync calls, host work, NCCL waits.
NCCL kernel durationnsys CUDA HW rowDistinguishes "comm is slow" from "comm overlaps badly."
DRAM throughputncu "Memory Workload Analysis"% of HBM peak this kernel achieves. <50 % → memory-side opportunity.
SM throughputncu "Compute Workload Analysis"% of peak FLOPs. Combined with DRAM, places you on the roofline (lesson 21).
Achieved Occupancyncu% of warps the SM is keeping in flight. Low → register/SMEM pressure or tiny grids.
Warp Stall Reasonsncu "Scheduler Statistics"Names the dominant stall: MIO Throttle (memory queue), Long Scoreboard (HBM latency), No Eligible (warp underflow), etc.
Register / Threadncu launch stats>128 limits occupancy; actual spills happen above the 255 per-thread cap (or below if compiled with -maxrregcount).

NVTX: making profiles readable

By default, profiles label work as "aten::matmul"-shaped strings. That is fine for one op but useless across a layer. NVTX ranges let you annotate code regions; both nsys and torch.profiler render them as named bars on the timeline. Use them whenever you have logical phases the tool can't infer.

import torch.cuda.nvtx as nvtx

nvtx.range_push("transformer_block")
out = self.norm1(x)
nvtx.range_push("attention")
out = self.attn(out)
nvtx.range_pop()
nvtx.range_push("mlp")
out = self.mlp(out)
nvtx.range_pop()
nvtx.range_pop()

Reading a torch.profiler table

After capture, this gives the most useful first look:

print(prof.key_averages().table(
    sort_by="cuda_time_total", row_limit=20
))

What to scan for:

Reading an nsys timeline

You're looking at three rows that matter:

Read it like a music score
Find the longest stretch of GPU-idle time. Move your eye up to the CPU row at the same instant. The CPU activity at that moment is what's blocking the GPU.

Reading an ncu report

Open the kernel in question. The top section ("GPU Speed Of Light") gives the headline: percent of peak compute and percent of peak memory throughput. Together they place the kernel on the roofline (lesson 21).

If both are low (under ~30 %), the kernel is stalled. Then "Scheduler Statistics" names the stall:

Cost & overhead, honestly

ToolOverheadWhen to keep it on
torch.profiler w/o memory~5–10 %Always fine for a few warmup+active steps.
torch.profiler w/ profile_memory & with_stack~30–80 %Once, to find a specific allocation. Don't ship.
nsys~2–10 %Production-like reproductions. Capture <30 s of run.
ncu --set full~5–50× per kernelOne kernel at a time, --launch-count 1. Never for "profile the whole training step."

Interactive · pick the right tool

Describe your symptom; the widget names the first tool to reach for. This isn't a real classifier; it's a memorization aid.

Symptom → first tool

Adjust the sliders to match your situation. The output picks the one tool whose strengths align.

What this gives you for the next lesson

You can collect data; lesson 21 teaches you what the data means. We'll take a few realistic kernel snapshots and walk them onto the roofline (lesson 01) to decide whether the lever is bandwidth, compute, or launch overhead — and what to do in each case.