Occupancy — registers, warps, latency hiding

An SM holds many warps and switches between them every cycle. More warps → better latency hiding → higher throughput, but each warp costs registers and shared memory. The non-obvious truth: maximum occupancy isn't always fastest.

The definition

Occupancy is the ratio of resident warps on an SM to the architectural maximum. On H100, that maximum is 64 warps per SM (2048 threads). If your kernel runs 32 resident warps per SM, that's 50% occupancy.

"Resident" means the warp is on the SM with its register state and SMEM allocated, ready to run. The hardware switches between resident warps every cycle to hide load latency (lesson 20).

The three limits on how many warps an SM can hold

For your kernel to be resident, the SM must have enough of three resources:

Resource	Per SM (H100)	Per warp	Cap on warps
Registers (32-bit)	65,536	32 · registers_per_thread	2048 / registers_per_thread
Shared memory	228 KB (configurable)	smem_per_block / warps_per_block	varies
Architectural max	—	—	64

The actual resident count is the minimum of these three. Whichever resource binds first sets your occupancy.

Register pressure

If your kernel uses 64 registers per thread, then 2048 / 64 = 32 warps fit — 50% occupancy.

If 128 registers per thread → 16 warps = 25% occupancy.

If 32 registers per thread → 64 warps = 100% occupancy.

The compiler decides the register count per thread based on the kernel code. Heavily-fused or arithmetic-rich kernels tend to use more registers. You see the count with nvcc -Xptxas -v:

ptxas info: Function _Z6kernelPf, 0 bytes smem, 4096 bytes lmem
ptxas info: Used 128 registers, 320 bytes cmem[0]

You can pressure the compiler with __launch_bounds__:

__launch_bounds__(256, 4)   // 256 threads/block, target ≥ 4 blocks/SM
__global__ void my_kernel(...) { ... }

The second argument is a register-budgeting target — the compiler tries to use few enough registers per thread that 4 blocks of 256 threads (32 warps) can fit. It's a hint, not a hard guarantee (the runtime still applies the usual occupancy caps). The compiler obeys by spilling registers to local memory if needed. Spilling is bad (local memory is backed by HBM, though small spills can hit L1/L2 first) so the constraint can backfire — only set it if you know your kernel benefits from more warps.

SMEM pressure

Each block uses some SMEM (lesson 24's tile, plus communication buffers). Total SMEM across resident blocks can't exceed 228 KB. If each block uses 32 KB of SMEM, only 7 blocks fit. With 256 threads/block = 8 warps, that's 56 warps = 87.5% occupancy.

Hard cap

64 warps. Even if registers and SMEM allow more, you can't have more than 64 resident warps on H100.

2D · the SM as a resource budget

Picture one SM. It has three "buckets" that all compete to limit how many warps can be resident: a register file (64K × 32-bit), a shared-memory pool (228 KB), and a hard cap on warp slots (64). Move the sliders; watch each bucket fill and see which one binds first.

The latency-hiding payoff

An HBM load takes ~600 cycles. If only 1 warp is resident, the SM stalls for those 600 cycles waiting for the data. If 16 warps are resident, the scheduler issues an instruction from a different warp every cycle; 16 cycles cover one warp's whole load latency.

So more warps = better latency hiding. The roofline-with-occupancy interpretation: at low occupancy, even compute-bound kernels can be latency-bound because the SM lacks ready work.

Asymptotically, you need enough warps to cover the longest pipeline. For an SM with 4 warp schedulers (one per partition), and typical ALU pipeline depth ~4–6 cycles, you need roughly that many warps per scheduler to cover ordinary math latency — well below full occupancy. HBM latency is the long pole (hundreds of cycles); covering it asks for many warps × many outstanding loads, which is why memory-bound kernels actually benefit from high occupancy.

Animated · latency hiding, low vs high occupancy

The cleanest demonstration of why occupancy matters. Each row is one warp's timeline. Black gaps are stalls waiting on HBM; coloured stretches are useful issue. With one warp, the gap is fully exposed. With many warps, the scheduler fills it from another warp — the SM is always busy. Toggle the slider to compare directly.

The "high occupancy isn't always fastest" paradox

Volkov's famous 2010 paper "Better Performance at Lower Occupancy" demonstrated that some kernels run faster with fewer warps. Why?

More registers per thread = more work per thread. Each thread can keep more values in registers, do more work between HBM accesses, increase per-thread arithmetic intensity.
Instruction-level parallelism within one warp. A single warp can have multiple instructions in flight simultaneously (different pipeline stages). 16 warps × deep ILP can match 64 warps × shallow ILP.
Spilling avoided. Forcing high occupancy → fewer registers per thread → spills to local memory → HBM traffic → slower.

The right phrase isn't "more occupancy is better" but "enough occupancy is necessary." Get to ~50% on H100 and additional occupancy yields diminishing returns; below ~25% you're usually latency-bound and should investigate.

Animated · registers ↔ warps ↔ throughput curve

The slider moves "registers per thread" continuously. As registers go up, fewer warps fit (occupancy drops), but each thread carries more work. The throughput curve is a hump: low end is latency-bound (too few warps), high end is register-spill-bound. The sweet spot is in the middle.

Per-thread work — register tiling

From lesson 24, the tiled matmul has one thread per output element. With T=16, that's 256 threads/block. To increase per-thread work, you can have one thread compute multiple output elements:

// each thread computes a 4×4 sub-tile of C in registers
float acc[4][4] = {0};
for (int kBlock = 0; kBlock < K; kBlock += T) {
    // load A and B tiles cooperatively into SMEM
    __syncthreads();
    for (int k = 0; k < T; k++) {
        float a_col[4] = { As[r*4+0][k], As[r*4+1][k], As[r*4+2][k], As[r*4+3][k] };
        float b_row[4] = { Bs[k][c*4+0], Bs[k][c*4+1], Bs[k][c*4+2], Bs[k][c*4+3] };
        for (int i = 0; i < 4; i++)
            for (int j = 0; j < 4; j++)
                acc[i][j] += a_col[i] * b_row[j];
    }
    __syncthreads();
}
// write out 16 elements per thread

16 output elements per thread → 16× FLOPs per SMEM load → much higher arithmetic intensity. The cost: acc takes 16 fp32 registers per thread, plus the a_col/b_row staging arrays, loop counters, and address temps. A realistic register footprint here is around ~64 registers/thread → on H100 (65,536 regs/SM, 2048 thread cap) that gives 65536/64 = 1024 threads/SM = 32 warps vs the 64-warp cap → ~50% occupancy. Real CUTLASS GEMM kernels push per-thread tiles of 8×8 or 16×16 in registers and routinely use 80–128 registers/thread, dropping occupancy below 50% — but the increase in compute-per-load dominates the occupancy loss, so wall-clock still improves (the Volkov regime).

Diagnosing — Nsight Compute is your friend

Two metrics to look at:

sm__warps_active — average resident warps. Divide by 64 for H100 to get occupancy.
sm__pipe_alu_cycles_active.avg.pct_of_peak_sustained_active — fraction of cycles the ALU pipeline was issuing. Low = stalled.

If occupancy is low and ALU active is low, you're latency-bound and need more warps. If occupancy is low but ALU active is high, you're already getting the work done — leave it alone (this is the Volkov regime).

Putting it together — the optimisation procedure

Compile with -Xptxas -v, note registers and SMEM per kernel.
Compute occupancy: min(2048/reg_per_thread, 228KB/smem_per_block · block_size, 64).
If occupancy < 25% AND you're latency-bound (low ALU active in Nsight): try __launch_bounds__, look for register-heavy expressions to refactor.
If occupancy < 50% AND you're compute-bound: you may already be optimal (Volkov regime).
If occupancy is high but throughput is low: look elsewhere — coalescing (lesson 23), bank conflicts, divergence (25).

The two-knob mental model

Knob	Direction	Effect on occupancy	Effect on per-thread work
Threads per block	↑	Higher (more warps per block)	Same per thread
Registers per thread	↑	Lower	More — bigger working set
SMEM per block	↑	Lower (fewer resident blocks)	More — bigger tile
Block count	↑	Doesn't change per-SM occupancy	—

The two main levers — registers and SMEM per block — both buy per-thread work at the cost of occupancy. The sweet spot for ML kernels is usually around 50% occupancy with substantial per-thread work via register tiling. Triton's autotuner sweeps exactly these axes (lesson 17's num_warps, num_stages, BLOCK_* knobs).

Interactive · occupancy calculator

Pick registers per thread, SMEM per block, threads per block. The widget computes which resource binds and the resulting occupancy.

Takeaway

Occupancy is "how many warps fit on an SM at once." It's bounded by registers, SMEM, and a hard cap. More occupancy = better latency hiding but possibly less work per thread. The right occupancy is enough to hide latency, not maximum possible. Often ~50% with strong per-thread work (register tiling, deep instruction-level parallelism) beats 100% with minimal per-thread work. Triton's autotuner sweeps the knobs for you; hand-CUDA programmers do it with profilers.