Reading a profile: roofline in practice
Profilers produce hundreds of numbers. Three of them — DRAM throughput, SM throughput, and achieved occupancy — place a kernel on the roofline from lesson 01 and tell you which lever to pull. This lesson is the decision tree you run on every kernel you don't recognize.
The question this lesson answers
An ncu report opens. There are six tabs and 200+ counters. You have 60 seconds before the room moves on. Which numbers do you read, in what order, to name the bottleneck class? And once named, what fix does each class call for?
The 3-number heuristic
For most kernels, three percentages decide everything. Read them in this order:
| # | Counter | What it means | Decision rule |
|---|---|---|---|
| 1 | DRAM Throughput (% peak) | How saturated is HBM during this kernel. | ≥70 % → memory-bound (and that's fine for many decode kernels). |
| 2 | SM Throughput (% peak) | How saturated tensor cores + math units are. | ≥70 % → compute-bound (healthy if expected). |
| 3 | Achieved Occupancy | Average % of warps in flight per SM. | If (1) & (2) both low and occupancy low → latency-bound (warp underflow). |
The four-quadrant decision tree
Three worked examples
Example A — RMSNorm (lesson 19 kernel) at N=1024, D=8192
| Counter | Value | Reading |
|---|---|---|
| DRAM Throughput | 78 % | HBM is saturated. |
| SM Throughput | 9 % | Almost no compute — math is one rsqrt and a multiply. |
| Achieved Occupancy | 62 % | Healthy enough to hide HBM latency. |
| Arithmetic intensity | ~0.5 FLOP/byte | Well into memory-bound region (break-even ≈ 295). |
Verdict: memory-bound, performing as expected. Further gains require reducing HBM bytes — only fusion with the upstream/downstream op (e.g. add the residual into the same kernel) or quantization will help. Don't reach for tile-size tuning; the kernel is on its roof.
Example B — bf16 GEMM 4096×4096×4096
| Counter | Value | Reading |
|---|---|---|
| DRAM Throughput | 22 % | Most data lives in L2 / SMEM across the K-loop. |
| SM Throughput | 83 % | Tensor cores are busy. |
| Achieved Occupancy | 40 % | Low, but that's typical for big-tile GEMMs. |
| Arithmetic intensity | ~340 FLOP/byte | Above the break-even — correctly compute-bound. |
Verdict: compute-bound, healthy. Big-tile GEMMs achieve low occupancy by design (each warp does a lot). Don't add warps; that hurts. Levers: fp8 (lesson 16), warp-specialized epilogue, async copies.
Example C — a custom Triton kernel with high register pressure
| Counter | Value | Reading |
|---|---|---|
| DRAM Throughput | 18 % | HBM has slack. |
| SM Throughput | 14 % | Tensor cores idle. |
| Achieved Occupancy | 12 % | Almost no warps in flight. |
| Top Stall Reason | "No Eligible" | Scheduler runs out of ready warps. |
| Registers / Thread | 178 | Heavy — well below the 255 spill cap, but limits how many warps fit per SM (hence the low occupancy below). |
Verdict: latency-bound from register pressure. Each warp uses so many registers that the SM can host very few warps, so when one stalls there's nothing else to run. Fixes: smaller tile, fewer accumulators, split the kernel, lower num_warps per program, or compile with a register cap.
Stall reasons — the named cures
| Stall reason | What's happening | Fix to try |
|---|---|---|
Long Scoreboard | Waiting on a long-latency HBM load. | Reduce loads (fusion, locality), use async copies / TMA, prefetch the next tile. |
MIO Throttle | Memory-instruction queue is full. | Coarsen loads (vectorize 4×fp16 into one 64-bit transaction), reduce store rate. |
Short Scoreboard | Waiting on SMEM or constant memory. | Check for bank conflicts; consider padding strides. |
No Eligible | No warps are ready to run. | Increase occupancy: smaller tiles, lower register pressure, more num_warps. |
Wait / Barrier | __syncthreads() imbalance. | Move work above the barrier; check for divergent paths reaching the barrier separately. |
Dispatch Stall | Instruction issue is starved. | Look for ILP problems — unroll loops, avoid serial dependency chains. |
Selected | Warp ran. Healthy. | This is the goal. |
The questions you ask in order
- Where is the kernel on the roofline? (DRAM% vs SM%)
- Is it where it should be? (cross-reference with the per-kernel arithmetic intensity from lesson 11)
- If yes — stop. The kernel is healthy; look elsewhere in the chain.
- If no — what's holding it back? Check occupancy and the top stall reason. That names the lever.
How to walk a real ncu report (60-second tour)
- "GPU Speed Of Light" panel: read DRAM % and SM %. Place on the roofline.
- If DRAM is the dominant side: open "Memory Workload Analysis." Look for bank conflicts, achieved load width vs requested, L1/L2 hit rate. Decide if you can move bytes closer to the core (SMEM, L2 reuse) or eliminate them (fusion, quantization).
- If SM is the dominant side: open "Compute Workload Analysis." Check tensor-core utilization and pipeline busy. If tensor-core util is low while SM% is high, you're spending time on integer/issue work — possibly addressing arithmetic. Larger tiles or fp8 are typical fixes.
- If both are low: open "Scheduler Statistics." The top stall reason names the cure.
- Always glance at: register count (spills?), achieved occupancy, theoretical occupancy.
Roofline view inside ncu
Ncu has a literal roofline plot. The dot is your kernel. The lines are the hardware limits. The right-of-the-ridge means compute-bound; left means memory-bound. Distance from the line is the headroom. Most reports include both fp32 and tensor-core roofs — pick the one matching your math precision.
Interactive · classify a profile snapshot
Move the three sliders to match what you see in a real ncu report. The widget names the bottleneck class and the first fix to try.
What this gives you for the next lesson
You can collect data (lesson 20) and read it (this lesson). The natural next move is to write a kernel that fixes what the data told you to fix. Lesson 22 covers Triton — the workhorse for fast custom kernels in PyTorch — and walks through the design loop: profile → hypothesis → kernel → re-profile.