gpu_kernel_serving / 08 · occupancy lesson 8 / 17

Occupancy — registers, warps, latency hiding

An SM holds many warps and switches between them every cycle. More warps → better latency hiding → higher throughput, but each warp costs registers and shared memory. The non-obvious truth: maximum occupancy isn't always fastest.

The definition

Occupancy is the ratio of resident warps on an SM to the architectural maximum. On H100, that maximum is 64 warps per SM (2048 threads). If your kernel runs 32 resident warps per SM, that's 50% occupancy.

"Resident" means the warp is on the SM with its register state and SMEM allocated, ready to run. The hardware switches between resident warps every cycle to hide load latency (lesson 20).

The three limits on how many warps an SM can hold

For your kernel to be resident, the SM must have enough of three resources:

ResourcePer SM (H100)Per warpCap on warps
Registers (32-bit)65,53632 · registers_per_thread2048 / registers_per_thread
Shared memory228 KB (configurable)smem_per_block / warps_per_blockvaries
Architectural max64

The actual resident count is the minimum of these three. Whichever resource binds first sets your occupancy.

Register pressure

If your kernel uses 64 registers per thread, then 2048 / 64 = 32 warps fit — 50% occupancy.

If 128 registers per thread → 16 warps = 25% occupancy.

If 32 registers per thread → 64 warps = 100% occupancy.

The compiler decides the register count per thread based on the kernel code. Heavily-fused or arithmetic-rich kernels tend to use more registers. You see the count with nvcc -Xptxas -v:

ptxas info: Function _Z6kernelPf, 0 bytes smem, 4096 bytes lmem
ptxas info: Used 128 registers, 320 bytes cmem[0]

You can pressure the compiler with __launch_bounds__:

__launch_bounds__(256, 4)   // 256 threads/block, target ≥ 4 blocks/SM
__global__ void my_kernel(...) { ... }

The second argument is a register-budgeting target — the compiler tries to use few enough registers per thread that 4 blocks of 256 threads (32 warps) can fit. It's a hint, not a hard guarantee (the runtime still applies the usual occupancy caps). The compiler obeys by spilling registers to local memory if needed. Spilling is bad (local memory is backed by HBM, though small spills can hit L1/L2 first) so the constraint can backfire — only set it if you know your kernel benefits from more warps.

SMEM pressure

Each block uses some SMEM (lesson 24's tile, plus communication buffers). Total SMEM across resident blocks can't exceed 228 KB. If each block uses 32 KB of SMEM, only 7 blocks fit. With 256 threads/block = 8 warps, that's 56 warps = 87.5% occupancy.

Hard cap

64 warps. Even if registers and SMEM allow more, you can't have more than 64 resident warps on H100.

2D · the SM as a resource budget

Picture one SM. It has three "buckets" that all compete to limit how many warps can be resident: a register file (64K × 32-bit), a shared-memory pool (228 KB), and a hard cap on warp slots (64). Move the sliders; watch each bucket fill and see which one binds first.

SM resource budget · 3 caps, 1 binds
Each bar shows how full a resource is at the current per-thread / per-block settings. The bar that fills first sets your occupancy.
resident warps
occupancy
binding bucket
resident blocks

The latency-hiding payoff

An HBM load takes ~600 cycles. If only 1 warp is resident, the SM stalls for those 600 cycles waiting for the data. If 16 warps are resident, the scheduler issues an instruction from a different warp every cycle; 16 cycles cover one warp's whole load latency.

So more warps = better latency hiding. The roofline-with-occupancy interpretation: at low occupancy, even compute-bound kernels can be latency-bound because the SM lacks ready work.

Asymptotically, you need enough warps to cover the longest pipeline. For an SM with 4 warp schedulers (one per partition), and typical ALU pipeline depth ~4–6 cycles, you need roughly that many warps per scheduler to cover ordinary math latency — well below full occupancy. HBM latency is the long pole (hundreds of cycles); covering it asks for many warps × many outstanding loads, which is why memory-bound kernels actually benefit from high occupancy.

Animated · latency hiding, low vs high occupancy

The cleanest demonstration of why occupancy matters. Each row is one warp's timeline. Black gaps are stalls waiting on HBM; coloured stretches are useful issue. With one warp, the gap is fully exposed. With many warps, the scheduler fills it from another warp — the SM is always busy. Toggle the slider to compare directly.

Latency hiding · warp timelines
Each warp issues a few instructions, then stalls 30 cycles on an HBM load. With enough warps, those stalls disappear into the schedule.
SM utilisation
cycles with issue
exposed stall %
verdict

The "high occupancy isn't always fastest" paradox

Volkov's famous 2010 paper "Better Performance at Lower Occupancy" demonstrated that some kernels run faster with fewer warps. Why?

The right phrase isn't "more occupancy is better" but "enough occupancy is necessary." Get to ~50% on H100 and additional occupancy yields diminishing returns; below ~25% you're usually latency-bound and should investigate.

Animated · registers ↔ warps ↔ throughput curve

The slider moves "registers per thread" continuously. As registers go up, fewer warps fit (occupancy drops), but each thread carries more work. The throughput curve is a hump: low end is latency-bound (too few warps), high end is register-spill-bound. The sweet spot is in the middle.

Tradeoff curve · registers vs occupancy vs throughput
Toy model: throughput = min(latency_hide_factor, per_thread_work_factor). Latency hiding saturates around 16–24 warps; per-thread work rises ~linearly with registers until spills kick in.
resident warps
latency hidden?
per-thread work
throughput

Per-thread work — register tiling

From lesson 24, the tiled matmul has one thread per output element. With T=16, that's 256 threads/block. To increase per-thread work, you can have one thread compute multiple output elements:

// each thread computes a 4×4 sub-tile of C in registers
float acc[4][4] = {0};
for (int kBlock = 0; kBlock < K; kBlock += T) {
    // load A and B tiles cooperatively into SMEM
    __syncthreads();
    for (int k = 0; k < T; k++) {
        float a_col[4] = { As[r*4+0][k], As[r*4+1][k], As[r*4+2][k], As[r*4+3][k] };
        float b_row[4] = { Bs[k][c*4+0], Bs[k][c*4+1], Bs[k][c*4+2], Bs[k][c*4+3] };
        for (int i = 0; i < 4; i++)
            for (int j = 0; j < 4; j++)
                acc[i][j] += a_col[i] * b_row[j];
    }
    __syncthreads();
}
// write out 16 elements per thread

16 output elements per thread → 16× FLOPs per SMEM load → much higher arithmetic intensity. The cost: acc takes 16 fp32 registers per thread, plus the a_col/b_row staging arrays, loop counters, and address temps. A realistic register footprint here is around ~64 registers/thread → on H100 (65,536 regs/SM, 2048 thread cap) that gives 65536/64 = 1024 threads/SM = 32 warps vs the 64-warp cap → ~50% occupancy. Real CUTLASS GEMM kernels push per-thread tiles of 8×8 or 16×16 in registers and routinely use 80–128 registers/thread, dropping occupancy below 50% — but the increase in compute-per-load dominates the occupancy loss, so wall-clock still improves (the Volkov regime).

Diagnosing — Nsight Compute is your friend

Two metrics to look at:

If occupancy is low and ALU active is low, you're latency-bound and need more warps. If occupancy is low but ALU active is high, you're already getting the work done — leave it alone (this is the Volkov regime).

Putting it together — the optimisation procedure

  1. Compile with -Xptxas -v, note registers and SMEM per kernel.
  2. Compute occupancy: min(2048/reg_per_thread, 228KB/smem_per_block · block_size, 64).
  3. If occupancy < 25% AND you're latency-bound (low ALU active in Nsight): try __launch_bounds__, look for register-heavy expressions to refactor.
  4. If occupancy < 50% AND you're compute-bound: you may already be optimal (Volkov regime).
  5. If occupancy is high but throughput is low: look elsewhere — coalescing (lesson 23), bank conflicts, divergence (25).

The two-knob mental model

KnobDirectionEffect on occupancyEffect on per-thread work
Threads per blockHigher (more warps per block)Same per thread
Registers per threadLowerMore — bigger working set
SMEM per blockLower (fewer resident blocks)More — bigger tile
Block countDoesn't change per-SM occupancy

The two main levers — registers and SMEM per block — both buy per-thread work at the cost of occupancy. The sweet spot for ML kernels is usually around 50% occupancy with substantial per-thread work via register tiling. Triton's autotuner sweeps exactly these axes (lesson 17's num_warps, num_stages, BLOCK_* knobs).

Interactive · occupancy calculator

Pick registers per thread, SMEM per block, threads per block. The widget computes which resource binds and the resulting occupancy.

Occupancy calculator
For H100 (2048 thread / 65536 register / 228 KB SMEM caps per SM). Picks the binding resource and computes resident warps.
resident warps
occupancy
binding resource
resident blocks
Takeaway
Occupancy is "how many warps fit on an SM at once." It's bounded by registers, SMEM, and a hard cap. More occupancy = better latency hiding but possibly less work per thread. The right occupancy is enough to hide latency, not maximum possible. Often ~50% with strong per-thread work (register tiling, deep instruction-level parallelism) beats 100% with minimal per-thread work. Triton's autotuner sweeps the knobs for you; hand-CUDA programmers do it with profilers.