Coalesced memory access
A warp's 32 threads issue 32 loads. If the addresses line up on a 128-byte boundary, one memory transaction serves them all. If they don't, you pay up to 32× as many. This is the single largest beginner-CUDA performance trap.
The hardware fact
HBM doesn't read at byte granularity. The smallest unit the memory controller reads is a cache line, typically 128 bytes (= 32 contiguous floats, or 64 bf16 values). Read one byte and you pay for the whole line — the rest goes into L2 in case anyone wants it.
Now layer on the warp. A warp's 32 threads each issue a load. The memory controller groups consecutive same-warp loads into memory transactions. The rule:
- If all 32 threads' addresses fall inside one 128-byte cache line → 1 transaction, peak bandwidth.
- If they spread across K distinct cache lines → K transactions, bandwidth divided by K.
This is "coalescing." A warp's 32-float load takes 1 transaction if the threads read consecutive floats; it takes 32 transactions if they read 32 floats spread across the address space.
Animated · 32 threads light up a 128-byte line
The warp issues 32 loads to addresses stride · threadIdx. Play to watch them fire one by one, painting cells on the memory line. Toggle stride to see the warp scatter across multiple cache lines — transaction count multiplies, bandwidth divides.
2D · the HBM strip and its 128-byte sectors
A wide horizontal strip representing HBM addresses. The warp's 32 reads pin down to specific sectors. Sectors are colour-coded by occupancy: green = fully used, orange = partially used (wasted bandwidth), grey = untouched.
2D · bandwidth utilisation as you slide stride
One knob — stride — and a single number: the percentage of HBM peak bandwidth your kernel achieves. The classic curve: stride 1 ≈ 100%, stride 2 ≈ 50%, stride 4 ≈ 25%, stride 32 ≈ 3%. Move it and watch the bar.
The canonical coalesced kernel
The vector add of lesson 22 is already coalesced:
int i = blockIdx.x * blockDim.x + threadIdx.x;
z[i] = x[i] + y[i];
Why? Because consecutive threads access consecutive elements. Thread 0 reads x[0], thread 1 reads x[1], …, thread 31 reads x[31]. Addresses 0, 4, 8, …, 124 (for fp32). All within a single 128-byte line. One transaction.
The pattern: arr[threadIdx.x] or arr[blockIdx.x * blockDim.x + threadIdx.x] is coalesced when threadIdx is on the fast dimension of the array. The fast dimension of a C-style 2D array arr[M][N] is the second (the inner). So:
arr[row][threadIdx.x]with row fixed across a warp → coalesced.arr[threadIdx.x][col]with col fixed across a warp → strided by N, bad.
The matrix-transpose textbook case
The canonical "I broke coalescing without knowing" kernel:
__global__ void transpose_naive(const float* in, float* out, int N) {
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
out[x * N + y] = in[y * N + x];
}
Look carefully:
- Read:
in[y * N + x]. Within a warp, x varies (threadIdx.x), y is fixed → consecutive memory addresses → coalesced read. Good. - Write:
out[x * N + y]. Within a warp, x varies (threadIdx.x), y is fixed → addresses are0*N+y, 1*N+y, 2*N+y, ...→ 32 separate cache lines → strided write. Bad.
The fix is the famous "transpose via shared memory" optimisation, which lesson 24 introduces. You load a tile coalesced, transpose it inside SMEM (where strides don't have the same cost), and write it coalesced.
The performance gap, numerically
For an elementwise kernel that should be HBM-bound:
| Pattern | Transactions per warp | Effective HBM bandwidth |
|---|---|---|
| Perfect coalescing | 1 per warp load | ~3.0 TB/s (90% of 3.35) |
| Stride 2 | 2 | ~1.5 TB/s |
| Stride 32 (cache-line-spaced) | 32 | ~100 GB/s (~30×) |
| Truly random | up to 32, with TLB misses | worse |
This is the single largest gap a beginner-CUDA programmer can leave on the floor. It explains why two kernels that "look the same" can have a 30× wall-clock difference.
How to diagnose
Three approaches in order of "first thing to do":
- Read your access pattern. Across a warp (variable:
threadIdx.xfrom 0 to 31), do the addresses you compute form a contiguous run? If yes, coalesced. - Profile with Nsight Compute. The Memory Workload Analysis section reports "L2 transactions per access" — a number close to 1 is good; close to 32 is bad.
l1tex__t_sector_hit_rateand similar metrics expose the controller's view. - Compare achieved bandwidth. Run
nvprof --print-gpu-traceor Nsight. If your kernel transfers D bytes but takes D / (BW/32) seconds, it's strided.
The five common patterns and how they coalesce
| Pattern | Code | Coalesced? |
|---|---|---|
| Linear traversal | arr[blockIdx.x*blockDim.x + threadIdx.x] | ✓ |
| Row of 2D, fast dim | arr[row][threadIdx.x] | ✓ |
| Column of 2D, fast dim | arr[threadIdx.x][col] | ✗ stride N |
| Padded 2D (pad=1) | arr[row][threadIdx.x + 1] | ✓ (with one wasted byte at line boundary) |
| Vectorised load (float4) | ((float4*)arr)[i] | ✓ — 16 bytes per thread, 4× wider |
That last row is the vector-load optimisation. Casting to float4* lets each thread load 16 bytes in one instruction. The warp issues a 32 × 16 = 512-byte load — which the controller still services as 4 cache-line transactions (the same count as 4 separate coalesced fp32 loads would take). The win isn't fewer transactions vs a coalesced scalar load (it's the same); it's fewer load instructions — 1 vector load instead of 4 scalar loads. Less issue pressure, more compute slots between loads. Vector loads do not fix uncoalesced access patterns — those still cost the controller the same scattered transactions.
What "stride 2" looks like in practice
The "interleaved real/imag" trap: if you have a complex array stored as float arr[N*2] with real at even indices and imag at odd, and a thread reads only the real parts, the access pattern arr[2*i] is stride 2 across the warp. Two transactions per warp instead of one. The fix is to separate the arrays (struct-of-arrays vs array-of-structs).
The structural lesson — SoA vs AoS
Consider: which is faster on the GPU?
// Array of structs
struct Particle { float x, y, z, vx, vy, vz; };
Particle p[N];
// Thread i reads p[i].x → strided by 6 floats
// vs Struct of arrays
struct Particles { float x[N], y[N], z[N], vx[N], vy[N], vz[N]; };
Particles ps;
// Thread i reads ps.x[i] → coalesced
The SoA layout wins on GPUs every time. CPU programmers often default to AoS (better cache locality for one element); GPU programmers default to SoA (better coalescing across a warp). Both are right for their hardware.
Why writes deserve special care
Global stores on Hopper skip L1 and go through L2 (which is write-back, so a coalesced store typically lands in L2 just like a read would). The coalescing rules are the same — uncoalesced writes mean separate transactions for each cache line. The cost gap to reads is smaller than people often think; the structural rule is what matters: 32 threads writing to consecutive addresses = 1 transaction.
So when you have a choice between "coalesced read + strided write" and "strided read + coalesced write", prefer the latter. The shared-memory transpose (lesson 24) buys you both.
Interactive · access-pattern visualizer
Pick a stride. The widget plots which cache lines a warp touches and reports the resulting transaction count and effective bandwidth fraction.