GPU execution model — SMs, warps, blocks, grids
A GPU is not "a bunch of cores." It's a hardware lattice with a specific shape, and every CUDA programming concept maps onto a specific piece of that lattice. Get the mapping right and the rest of CUDA is mostly addressing arithmetic.
The hardware, named
An H100 GPU contains 132 streaming multiprocessors (SMs). Each SM is a small, mostly-independent processor with:
- 4 partitions (called "warp schedulers"), each with its own register file and execution units.
- 256 KB of register file total (~65k 32-bit registers).
- 228 KB of configurable shared memory / L1 cache.
- Tensor cores: 4 per SM (one per partition), each capable of a small matmul per cycle.
- "CUDA cores": 128 per SM (32 per partition) — fp32 ALUs that handle non-tensor-core math.
So the H100 has 16,896 fp32 ALUs (132 × 128) and 528 tensor cores (132 × 4). The marketing number "16,896 CUDA cores" comes from this multiplication. Don't think of the GPU as "a bunch of cores" — think of it as 132 SMs that each schedule work in 32-thread groups.
The software-to-hardware mapping
CUDA exposes four levels:
| Software concept | Maps to hardware | Size |
|---|---|---|
| Thread | One ALU/tensor-core lane in an SM partition | 1 |
| Warp (implicit) | 32 threads that execute the same instruction together (SIMT) | 32 threads |
| Thread block (a.k.a. CTA) | 1–32 warps on one SM | ≤ 1024 threads |
| Grid | Many blocks, scheduled across all SMs | up to billions of blocks |
When you launch a kernel k<<<grid, block>>>(...), the runtime distributes the grid's blocks to SMs. Each SM holds several blocks resident simultaneously (limited by registers/SMEM); blocks not yet resident wait in a queue. Once a block is assigned to an SM, it stays there until it completes — no migration.
SIMT — the source of warp-level performance
"Single Instruction Multiple Threads": the 32 threads of a warp execute the same instruction at the same time, each on its own data lane. This is similar to SIMD (like AVX), but with two key extensions:
- Each thread has its own state. Independent registers, independent program counter (since CC 7.0).
- Threads can take divergent branches — but the warp serialises them: it runs all "if" lanes with the other lanes masked off, then all "else" lanes. This is the cost lesson 25 dwells on.
The "32" is hardware: every NVIDIA GPU from Fermi to Blackwell has 32-thread warps. AMD's CDNA architecture uses 64-thread "wavefronts." That number creeps into every CUDA tuning constant.
Animated · SIMT, with and without divergence
32 threads of a warp ticking through instructions in lockstep. Toggle the divergent branch to see the warp serialise: half the threads execute the "if" body while the other half stalls (masked), then they swap. The bottom strip records the warp's executed-instructions-per-cycle.
Animated · block scheduling onto SMs
A kernel launches B blocks. The runtime distributes them across K SMs, packing up to max blocks/SM in each. As blocks finish, queued ones slide in. Notice how a small B leaves SMs idle (under-utilisation) while a large B needs multiple waves.
3D · SM lattice, isometric
Each SM is a 3D box; inside it, warp slots hold 32 threads represented as dots. Click an SM to "schedule" a kernel onto it: warps fill its slots one at a time, in waves. Hopper's H100 has 132 SMs — we show 16 for legibility.
Threads per block, blocks per grid — choosing them
You pick the launch shape. A common pattern:
// add two vectors of N elements
int threads_per_block = 256; // good default — 8 warps
int blocks = (N + threads_per_block - 1) / threads_per_block;
add_kernel<<<blocks, threads_per_block>>>(x, y, out, N);
Some rules of thumb:
- Threads per block: a multiple of 32. Anything else wastes a partial warp.
- 128 to 512 threads per block for most kernels. 128 = 4 warps (more granular); 256 = 8 warps (most defaults); 512+ = fewer blocks per SM, sometimes faster for shared-memory-heavy kernels.
- Blocks per grid: enough to fill all SMs several times over. H100 has 132 SMs; 132 × 4 (typical resident blocks) = ~500 blocks minimum to fully fill the GPU. More is fine; the scheduler queues them.
Latency hiding — why warps exist
HBM load takes ~400–800 cycles. ALU operations take 1–4. If a warp issues a load, it can't proceed for ~hundreds of cycles. But each SM holds 1–4 dozen warps resident; while warp A waits on memory, the scheduler runs an instruction from warp B, then C, then D. By the time we cycle back to A, the load might be done.
This is latency hiding via warp swapping. It's why having many warps per SM (lesson 27's "occupancy") matters: more in-flight warps = more work to swap to while any one warp waits. Without enough warps, the SM stalls and your kernel becomes latency-bound.
Indexing in a kernel — the three coordinate systems
Every CUDA kernel knows its place in the hierarchy via these built-in variables:
| Variable | Type | Meaning |
|---|---|---|
threadIdx | uint3 (x,y,z) | This thread's index within its block (0..blockDim-1) |
blockIdx | uint3 | This block's index within the grid (0..gridDim-1) |
blockDim | uint3 | Shape of each block (e.g. (256, 1, 1)) |
gridDim | uint3 | Shape of the grid (e.g. (1024, 1, 1)) |
The canonical "global index" arithmetic:
int tid = blockIdx.x * blockDim.x + threadIdx.x; // 0..N-1 across the whole grid
if (tid < N) {
// do work for element tid
}
The bounds check (tid < N) exists because launches usually round up. If you have N=1000 elements and 256 threads/block, you launch 4 blocks = 1024 threads. The last 24 must do nothing.
The five-line minimum CUDA program
__global__ void add(float* x, float* y, float* z, int N) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) z[i] = x[i] + y[i];
}
// host:
add<<<(N+255)/256, 256>>>(dx, dy, dz, N);
Lesson 22 walks through this end-to-end with the memcpy and error-handling around it. For now, just notice: the kernel is written from one thread's perspective; the launch specifies how many to run.
Streams & concurrent execution
Multiple kernels can run concurrently if they're on different streams (lesson 15's vocabulary). Same SMs, different work. Two cases where this matters:
- Compute / copy overlap. Issue a kernel on stream A while a
cudaMemcpyAsyncruns on stream B. The DMA engine and the SMs work simultaneously. - Compute / collective overlap. A kernel on stream A while NCCL's AllReduce kernel runs on stream B. This is exactly the DDP overlap trick from lesson 04.
Cooperative groups (recent additions)
Newer GPUs (CC 9.0+ on Hopper) add thread block clusters: groups of up to 16 blocks that share a "distributed shared memory" address space. This lets warps in different blocks read each other's SMEM via cluster-level barriers. Used by FlashAttention-3's warp-specialisation pattern. We won't write code with it in this series — just know it exists, and that the "block on one SM" hardware rule has softened on Hopper.
Interactive · grid layout calculator
Pick N (data size) and threads-per-block. The widget shows you the resulting grid, the resident-block count per SM (estimated), the number of warps in flight, and whether you're under-utilising or over-subscribing the device.