Coalesced memory access

A warp's 32 threads issue 32 loads. If the addresses line up on a 128-byte boundary, one memory transaction serves them all. If they don't, you pay up to 32× as many. This is the single largest beginner-CUDA performance trap.

The hardware fact

HBM doesn't read at byte granularity. The smallest unit the memory controller reads is a cache line, typically 128 bytes (= 32 contiguous floats, or 64 bf16 values). Read one byte and you pay for the whole line — the rest goes into L2 in case anyone wants it.

Now layer on the warp. A warp's 32 threads each issue a load. The memory controller groups consecutive same-warp loads into memory transactions. The rule:

If all 32 threads' addresses fall inside one 128-byte cache line → 1 transaction, peak bandwidth.
If they spread across K distinct cache lines → K transactions, bandwidth divided by K.

This is "coalescing." A warp's 32-float load takes 1 transaction if the threads read consecutive floats; it takes 32 transactions if they read 32 floats spread across the address space.

Animated · 32 threads light up a 128-byte line

The warp issues 32 loads to addresses stride · threadIdx. Play to watch them fire one by one, painting cells on the memory line. Toggle stride to see the warp scatter across multiple cache lines — transaction count multiplies, bandwidth divides.

2D · the HBM strip and its 128-byte sectors

A wide horizontal strip representing HBM addresses. The warp's 32 reads pin down to specific sectors. Sectors are colour-coded by occupancy: green = fully used, orange = partially used (wasted bandwidth), grey = untouched.

2D · bandwidth utilisation as you slide stride

One knob — stride — and a single number: the percentage of HBM peak bandwidth your kernel achieves. The classic curve: stride 1 ≈ 100%, stride 2 ≈ 50%, stride 4 ≈ 25%, stride 32 ≈ 3%. Move it and watch the bar.

The canonical coalesced kernel

The vector add of lesson 22 is already coalesced:

int i = blockIdx.x * blockDim.x + threadIdx.x;
z[i] = x[i] + y[i];

Why? Because consecutive threads access consecutive elements. Thread 0 reads x[0], thread 1 reads x[1], …, thread 31 reads x[31]. Addresses 0, 4, 8, …, 124 (for fp32). All within a single 128-byte line. One transaction.

The pattern: arr[threadIdx.x] or arr[blockIdx.x * blockDim.x + threadIdx.x] is coalesced when threadIdx is on the fast dimension of the array. The fast dimension of a C-style 2D array arr[M][N] is the second (the inner). So:

arr[row][threadIdx.x] with row fixed across a warp → coalesced.
arr[threadIdx.x][col] with col fixed across a warp → strided by N, bad.

The matrix-transpose textbook case

The canonical "I broke coalescing without knowing" kernel:

__global__ void transpose_naive(const float* in, float* out, int N) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    out[x * N + y] = in[y * N + x];
}

Look carefully:

Read: in[y * N + x]. Within a warp, x varies (threadIdx.x), y is fixed → consecutive memory addresses → coalesced read. Good.
Write: out[x * N + y]. Within a warp, x varies (threadIdx.x), y is fixed → addresses are 0*N+y, 1*N+y, 2*N+y, ... → 32 separate cache lines → strided write. Bad.

The fix is the famous "transpose via shared memory" optimisation, which lesson 24 introduces. You load a tile coalesced, transpose it inside SMEM (where strides don't have the same cost), and write it coalesced.

The performance gap, numerically

For an elementwise kernel that should be HBM-bound:

Pattern	Transactions per warp	Effective HBM bandwidth
Perfect coalescing	1 per warp load	~3.0 TB/s (90% of 3.35)
Stride 2	2	~1.5 TB/s
Stride 32 (cache-line-spaced)	32	~100 GB/s (~30×)
Truly random	up to 32, with TLB misses	worse

This is the single largest gap a beginner-CUDA programmer can leave on the floor. It explains why two kernels that "look the same" can have a 30× wall-clock difference.

How to diagnose

Three approaches in order of "first thing to do":

Read your access pattern. Across a warp (variable: threadIdx.x from 0 to 31), do the addresses you compute form a contiguous run? If yes, coalesced.
Profile with Nsight Compute. The Memory Workload Analysis section reports "L2 transactions per access" — a number close to 1 is good; close to 32 is bad. l1tex__t_sector_hit_rate and similar metrics expose the controller's view.
Compare achieved bandwidth. Run nvprof --print-gpu-trace or Nsight. If your kernel transfers D bytes but takes D / (BW/32) seconds, it's strided.

The five common patterns and how they coalesce

Pattern	Code	Coalesced?
Linear traversal	`arr[blockIdx.x*blockDim.x + threadIdx.x]`	✓
Row of 2D, fast dim	`arr[row][threadIdx.x]`	✓
Column of 2D, fast dim	`arr[threadIdx.x][col]`	✗ stride N
Padded 2D (pad=1)	`arr[row][threadIdx.x + 1]`	✓ (with one wasted byte at line boundary)
Vectorised load (float4)	`((float4*)arr)[i]`	✓ — 16 bytes per thread, 4× wider

That last row is the vector-load optimisation. Casting to float4* lets each thread load 16 bytes in one instruction. The warp issues a 32 × 16 = 512-byte load — which the controller still services as 4 cache-line transactions (the same count as 4 separate coalesced fp32 loads would take). The win isn't fewer transactions vs a coalesced scalar load (it's the same); it's fewer load instructions — 1 vector load instead of 4 scalar loads. Less issue pressure, more compute slots between loads. Vector loads do not fix uncoalesced access patterns — those still cost the controller the same scattered transactions.

What "stride 2" looks like in practice

The "interleaved real/imag" trap: if you have a complex array stored as float arr[N*2] with real at even indices and imag at odd, and a thread reads only the real parts, the access pattern arr[2*i] is stride 2 across the warp. Two transactions per warp instead of one. The fix is to separate the arrays (struct-of-arrays vs array-of-structs).

The structural lesson — SoA vs AoS

Consider: which is faster on the GPU?

// Array of structs
struct Particle { float x, y, z, vx, vy, vz; };
Particle p[N];
// Thread i reads p[i].x → strided by 6 floats

// vs Struct of arrays
struct Particles { float x[N], y[N], z[N], vx[N], vy[N], vz[N]; };
Particles ps;
// Thread i reads ps.x[i] → coalesced

The SoA layout wins on GPUs every time. CPU programmers often default to AoS (better cache locality for one element); GPU programmers default to SoA (better coalescing across a warp). Both are right for their hardware.

Why writes deserve special care

Global stores on Hopper skip L1 and go through L2 (which is write-back, so a coalesced store typically lands in L2 just like a read would). The coalescing rules are the same — uncoalesced writes mean separate transactions for each cache line. The cost gap to reads is smaller than people often think; the structural rule is what matters: 32 threads writing to consecutive addresses = 1 transaction.

So when you have a choice between "coalesced read + strided write" and "strided read + coalesced write", prefer the latter. The shared-memory transpose (lesson 24) buys you both.

Interactive · access-pattern visualizer

Pick a stride. The widget plots which cache lines a warp touches and reports the resulting transaction count and effective bandwidth fraction.

Takeaway

Memory loads happen at cache-line granularity (128 bytes). A warp's 32 threads ideally read 32 consecutive elements, hitting 1 line, 1 transaction. Any stride larger than the element size scatters the warp across multiple lines, multiplying transactions. The fix is structural — Struct-of-Arrays, vector loads, transpose-via-SMEM. Get this right and your kernels run at memory-bandwidth-roofline speed.