Reading a profile: roofline in practice

Profilers produce hundreds of numbers. Three of them — DRAM throughput, SM throughput, and achieved occupancy — place a kernel on the roofline from lesson 01 and tell you which lever to pull. This lesson is the decision tree you run on every kernel you don't recognize.

The question this lesson answers

An ncu report opens. There are six tabs and 200+ counters. You have 60 seconds before the room moves on. Which numbers do you read, in what order, to name the bottleneck class? And once named, what fix does each class call for?

The 3-number heuristic

For most kernels, three percentages decide everything. Read them in this order:

#	Counter	What it means	Decision rule
1	DRAM Throughput (% peak)	How saturated is HBM during this kernel.	≥70 % → memory-bound (and that's fine for many decode kernels).
2	SM Throughput (% peak)	How saturated tensor cores + math units are.	≥70 % → compute-bound (healthy if expected).
3	Achieved Occupancy	Average % of warps in flight per SM.	If (1) & (2) both low and occupancy low → latency-bound (warp underflow).

The four-quadrant decision tree

Three worked examples

Example A — RMSNorm (lesson 19 kernel) at N=1024, D=8192

Counter	Value	Reading
DRAM Throughput	78 %	HBM is saturated.
SM Throughput	9 %	Almost no compute — math is one rsqrt and a multiply.
Achieved Occupancy	62 %	Healthy enough to hide HBM latency.
Arithmetic intensity	~0.5 FLOP/byte	Well into memory-bound region (break-even ≈ 295).

Verdict: memory-bound, performing as expected. Further gains require reducing HBM bytes — only fusion with the upstream/downstream op (e.g. add the residual into the same kernel) or quantization will help. Don't reach for tile-size tuning; the kernel is on its roof.

Example B — bf16 GEMM 4096×4096×4096

Counter	Value	Reading
DRAM Throughput	22 %	Most data lives in L2 / SMEM across the K-loop.
SM Throughput	83 %	Tensor cores are busy.
Achieved Occupancy	40 %	Low, but that's typical for big-tile GEMMs.
Arithmetic intensity	~340 FLOP/byte	Above the break-even — correctly compute-bound.

Verdict: compute-bound, healthy. Big-tile GEMMs achieve low occupancy by design (each warp does a lot). Don't add warps; that hurts. Levers: fp8 (lesson 16), warp-specialized epilogue, async copies.

Example C — a custom Triton kernel with high register pressure

Counter	Value	Reading
DRAM Throughput	18 %	HBM has slack.
SM Throughput	14 %	Tensor cores idle.
Achieved Occupancy	12 %	Almost no warps in flight.
Top Stall Reason	"No Eligible"	Scheduler runs out of ready warps.
Registers / Thread	178	Heavy — well below the 255 spill cap, but limits how many warps fit per SM (hence the low occupancy below).

Verdict: latency-bound from register pressure. Each warp uses so many registers that the SM can host very few warps, so when one stalls there's nothing else to run. Fixes: smaller tile, fewer accumulators, split the kernel, lower num_warps per program, or compile with a register cap.

Stall reasons — the named cures

Stall reason	What's happening	Fix to try
`Long Scoreboard`	Waiting on a long-latency HBM load.	Reduce loads (fusion, locality), use async copies / TMA, prefetch the next tile.
`MIO Throttle`	Memory-instruction queue is full.	Coarsen loads (vectorize 4×fp16 into one 64-bit transaction), reduce store rate.
`Short Scoreboard`	Waiting on SMEM or constant memory.	Check for bank conflicts; consider padding strides.
`No Eligible`	No warps are ready to run.	Increase occupancy: smaller tiles, lower register pressure, more `num_warps`.
`Wait` / `Barrier`	`__syncthreads()` imbalance.	Move work above the barrier; check for divergent paths reaching the barrier separately.
`Dispatch Stall`	Instruction issue is starved.	Look for ILP problems — unroll loops, avoid serial dependency chains.
`Selected`	Warp ran. Healthy.	This is the goal.

The questions you ask in order

Where is the kernel on the roofline? (DRAM% vs SM%)
Is it where it should be? (cross-reference with the per-kernel arithmetic intensity from lesson 11)
If yes — stop. The kernel is healthy; look elsewhere in the chain.
If no — what's holding it back? Check occupancy and the top stall reason. That names the lever.

Common trap

Don't optimize a kernel that is already at its roofline. A 78 % DRAM-bound RMSNorm is not "slow." It is reading the bytes the math requires at the rate the hardware delivers. Optimizing it can only mean changing the math — i.e., fusing with a neighbor.

How to walk a real ncu report (60-second tour)

"GPU Speed Of Light" panel: read DRAM % and SM %. Place on the roofline.
If DRAM is the dominant side: open "Memory Workload Analysis." Look for bank conflicts, achieved load width vs requested, L1/L2 hit rate. Decide if you can move bytes closer to the core (SMEM, L2 reuse) or eliminate them (fusion, quantization).
If SM is the dominant side: open "Compute Workload Analysis." Check tensor-core utilization and pipeline busy. If tensor-core util is low while SM% is high, you're spending time on integer/issue work — possibly addressing arithmetic. Larger tiles or fp8 are typical fixes.
If both are low: open "Scheduler Statistics." The top stall reason names the cure.
Always glance at: register count (spills?), achieved occupancy, theoretical occupancy.

Roofline view inside ncu

Ncu has a literal roofline plot. The dot is your kernel. The lines are the hardware limits. The right-of-the-ridge means compute-bound; left means memory-bound. Distance from the line is the headroom. Most reports include both fp32 and tensor-core roofs — pick the one matching your math precision.

Interactive · classify a profile snapshot

Move the three sliders to match what you see in a real ncu report. The widget names the bottleneck class and the first fix to try.

What this gives you for the next lesson

You can collect data (lesson 20) and read it (this lesson). The natural next move is to write a kernel that fixes what the data told you to fix. Lesson 22 covers Triton — the workhorse for fast custom kernels in PyTorch — and walks through the design loop: profile → hypothesis → kernel → re-profile.