gpu_kernel_serving / 01 · GPU execution model lesson 1 / 17

GPU execution model — SMs, warps, blocks, grids

A GPU is not "a bunch of cores." It's a hardware lattice with a specific shape, and every CUDA programming concept maps onto a specific piece of that lattice. Get the mapping right and the rest of CUDA is mostly addressing arithmetic.

The hardware, named

An H100 GPU contains 132 streaming multiprocessors (SMs). Each SM is a small, mostly-independent processor with:

So the H100 has 16,896 fp32 ALUs (132 × 128) and 528 tensor cores (132 × 4). The marketing number "16,896 CUDA cores" comes from this multiplication. Don't think of the GPU as "a bunch of cores" — think of it as 132 SMs that each schedule work in 32-thread groups.

The software-to-hardware mapping

CUDA exposes four levels:

Software conceptMaps to hardwareSize
ThreadOne ALU/tensor-core lane in an SM partition1
Warp (implicit)32 threads that execute the same instruction together (SIMT)32 threads
Thread block (a.k.a. CTA)1–32 warps on one SM≤ 1024 threads
GridMany blocks, scheduled across all SMsup to billions of blocks

When you launch a kernel k<<<grid, block>>>(...), the runtime distributes the grid's blocks to SMs. Each SM holds several blocks resident simultaneously (limited by registers/SMEM); blocks not yet resident wait in a queue. Once a block is assigned to an SM, it stays there until it completes — no migration.

grid → blocks → warps → threads GRID (e.g., 1024 blocks) block 0 (SM 5) 128 threads · 4 warps block 1 (SM 12) block 2 (SM 5) w0 w1 w2 w3 a warp = 32 threads in lockstep; warps share registers + SMEM with siblings in the same block blocks 0 and 2 happen to land on the same SM; block 1 on a different one — the scheduler chooses

SIMT — the source of warp-level performance

"Single Instruction Multiple Threads": the 32 threads of a warp execute the same instruction at the same time, each on its own data lane. This is similar to SIMD (like AVX), but with two key extensions:

The "32" is hardware: every NVIDIA GPU from Fermi to Blackwell has 32-thread warps. AMD's CDNA architecture uses 64-thread "wavefronts." That number creeps into every CUDA tuning constant.

Animated · SIMT, with and without divergence

32 threads of a warp ticking through instructions in lockstep. Toggle the divergent branch to see the warp serialise: half the threads execute the "if" body while the other half stalls (masked), then they swap. The bottom strip records the warp's executed-instructions-per-cycle.

SIMT execution · 32 threads, one program counter
Each column is a thread lane (0..31). Each row is a clock cycle. Green = active lane executing. Grey = masked lane stalled. Toggle "divergent" to fork even-indexed lanes onto a different branch.
cycle
active lanes
effective IPC
divergence cost

Animated · block scheduling onto SMs

A kernel launches B blocks. The runtime distributes them across K SMs, packing up to max blocks/SM in each. As blocks finish, queued ones slide in. Notice how a small B leaves SMs idle (under-utilisation) while a large B needs multiple waves.

Block scheduling · grid → SMs
Top: queue of blocks waiting for an SM slot. Bottom: SM lattice; each SM has slots that fill with coloured blocks (one colour per wave). Click "step" to advance simulated time.
time tick
resident blocks
queued blocks
SM utilisation

3D · SM lattice, isometric

Each SM is a 3D box; inside it, warp slots hold 32 threads represented as dots. Click an SM to "schedule" a kernel onto it: warps fill its slots one at a time, in waves. Hopper's H100 has 132 SMs — we show 16 for legibility.

SM lattice · click to schedule
Each box is one SM. Inside: 4 warp scheduler partitions × N slots. Each slot is a warp = 32 thread dots. Yellow edges show NoC connections; warp dots fade in as warps become resident.
selected SM
— click one —
warps resident
threads resident
total in-flight

Threads per block, blocks per grid — choosing them

You pick the launch shape. A common pattern:

// add two vectors of N elements
int threads_per_block = 256;       // good default — 8 warps
int blocks = (N + threads_per_block - 1) / threads_per_block;
add_kernel<<<blocks, threads_per_block>>>(x, y, out, N);

Some rules of thumb:

Latency hiding — why warps exist

HBM load takes ~400–800 cycles. ALU operations take 1–4. If a warp issues a load, it can't proceed for ~hundreds of cycles. But each SM holds 1–4 dozen warps resident; while warp A waits on memory, the scheduler runs an instruction from warp B, then C, then D. By the time we cycle back to A, the load might be done.

This is latency hiding via warp swapping. It's why having many warps per SM (lesson 27's "occupancy") matters: more in-flight warps = more work to swap to while any one warp waits. Without enough warps, the SM stalls and your kernel becomes latency-bound.

Why this differs from CPU multithreading
A CPU thread switch costs hundreds of cycles (save registers, swap state). A GPU warp switch is free — every resident warp already has its own register file allocated. The scheduler picks a ready warp every clock. That's what makes "32 threads × N warps × 132 SMs" of in-flight parallelism affordable.

Indexing in a kernel — the three coordinate systems

Every CUDA kernel knows its place in the hierarchy via these built-in variables:

VariableTypeMeaning
threadIdxuint3 (x,y,z)This thread's index within its block (0..blockDim-1)
blockIdxuint3This block's index within the grid (0..gridDim-1)
blockDimuint3Shape of each block (e.g. (256, 1, 1))
gridDimuint3Shape of the grid (e.g. (1024, 1, 1))

The canonical "global index" arithmetic:

int tid = blockIdx.x * blockDim.x + threadIdx.x;   // 0..N-1 across the whole grid
if (tid < N) {
    // do work for element tid
}

The bounds check (tid < N) exists because launches usually round up. If you have N=1000 elements and 256 threads/block, you launch 4 blocks = 1024 threads. The last 24 must do nothing.

The five-line minimum CUDA program

__global__ void add(float* x, float* y, float* z, int N) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N) z[i] = x[i] + y[i];
}

// host:
add<<<(N+255)/256, 256>>>(dx, dy, dz, N);

Lesson 22 walks through this end-to-end with the memcpy and error-handling around it. For now, just notice: the kernel is written from one thread's perspective; the launch specifies how many to run.

Streams & concurrent execution

Multiple kernels can run concurrently if they're on different streams (lesson 15's vocabulary). Same SMs, different work. Two cases where this matters:

Cooperative groups (recent additions)

Newer GPUs (CC 9.0+ on Hopper) add thread block clusters: groups of up to 16 blocks that share a "distributed shared memory" address space. This lets warps in different blocks read each other's SMEM via cluster-level barriers. Used by FlashAttention-3's warp-specialisation pattern. We won't write code with it in this series — just know it exists, and that the "block on one SM" hardware rule has softened on Hopper.

Interactive · grid layout calculator

Pick N (data size) and threads-per-block. The widget shows you the resulting grid, the resident-block count per SM (estimated), the number of warps in flight, and whether you're under-utilising or over-subscribing the device.

Grid & occupancy calculator
Toy model. Real occupancy also depends on register and SMEM usage per thread (lesson 27).
blocks
warps/block
resident blocks (cap)
grid waves
Takeaway
Threads run in lockstep groups of 32 (warps). Warps group into blocks that live on a single SM. Blocks group into a grid that the runtime spreads across all SMs. Latency hiding works because many warps are resident at once and the scheduler swaps between them for free. Every CUDA optimisation we'll meet (coalescing, SMEM, occupancy, tensor cores) is about using this lattice well.