gpu_kernel_serving / 04 · coalesced access lesson 4 / 17

Coalesced memory access

A warp's 32 threads issue 32 loads. If the addresses line up on a 128-byte boundary, one memory transaction serves them all. If they don't, you pay up to 32× as many. This is the single largest beginner-CUDA performance trap.

The hardware fact

HBM doesn't read at byte granularity. The smallest unit the memory controller reads is a cache line, typically 128 bytes (= 32 contiguous floats, or 64 bf16 values). Read one byte and you pay for the whole line — the rest goes into L2 in case anyone wants it.

Now layer on the warp. A warp's 32 threads each issue a load. The memory controller groups consecutive same-warp loads into memory transactions. The rule:

This is "coalescing." A warp's 32-float load takes 1 transaction if the threads read consecutive floats; it takes 32 transactions if they read 32 floats spread across the address space.

coalesced (1 transaction) vs strided (32 transactions) coalesced — threads 0..31 read addresses 0..127: cache line · 128 bytes · 1 transaction services all 32 threads strided — threads 0..31 read addresses 0, 128, 256, 384, …: 32 cache lines · 32 separate transactions · 32× the bandwidth wasted

Animated · 32 threads light up a 128-byte line

The warp issues 32 loads to addresses stride · threadIdx. Play to watch them fire one by one, painting cells on the memory line. Toggle stride to see the warp scatter across multiple cache lines — transaction count multiplies, bandwidth divides.

Warp memory access · coalesced vs strided
Each thread (0..31) fires a load. Cells light when touched. Sectors with at least one access turn solid (paid for); unused parts of a touched sector are wasted bandwidth.
threads fired
transactions
bytes requested
efficiency

2D · the HBM strip and its 128-byte sectors

A wide horizontal strip representing HBM addresses. The warp's 32 reads pin down to specific sectors. Sectors are colour-coded by occupancy: green = fully used, orange = partially used (wasted bandwidth), grey = untouched.

HBM transaction map · sectors hit by one warp
Strip = first N sectors of HBM (128 B each). Threads' addresses plotted as ticks; sectors classified by fraction of bytes actually requested.
sectors hit
fully used
partial / wasted
bandwidth eff

2D · bandwidth utilisation as you slide stride

One knob — stride — and a single number: the percentage of HBM peak bandwidth your kernel achieves. The classic curve: stride 1 ≈ 100%, stride 2 ≈ 50%, stride 4 ≈ 25%, stride 32 ≈ 3%. Move it and watch the bar.

Effective bandwidth · the cost of stride
Single warp; load access_bytes per thread. Eff = bytes_useful / bytes_transferred (128B per sector touched).
transactions
efficiency
effective BW
verdict

The canonical coalesced kernel

The vector add of lesson 22 is already coalesced:

int i = blockIdx.x * blockDim.x + threadIdx.x;
z[i] = x[i] + y[i];

Why? Because consecutive threads access consecutive elements. Thread 0 reads x[0], thread 1 reads x[1], …, thread 31 reads x[31]. Addresses 0, 4, 8, …, 124 (for fp32). All within a single 128-byte line. One transaction.

The pattern: arr[threadIdx.x] or arr[blockIdx.x * blockDim.x + threadIdx.x] is coalesced when threadIdx is on the fast dimension of the array. The fast dimension of a C-style 2D array arr[M][N] is the second (the inner). So:

The matrix-transpose textbook case

The canonical "I broke coalescing without knowing" kernel:

__global__ void transpose_naive(const float* in, float* out, int N) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    out[x * N + y] = in[y * N + x];
}

Look carefully:

The fix is the famous "transpose via shared memory" optimisation, which lesson 24 introduces. You load a tile coalesced, transpose it inside SMEM (where strides don't have the same cost), and write it coalesced.

The performance gap, numerically

For an elementwise kernel that should be HBM-bound:

PatternTransactions per warpEffective HBM bandwidth
Perfect coalescing1 per warp load~3.0 TB/s (90% of 3.35)
Stride 22~1.5 TB/s
Stride 32 (cache-line-spaced)32~100 GB/s (~30×)
Truly randomup to 32, with TLB missesworse

This is the single largest gap a beginner-CUDA programmer can leave on the floor. It explains why two kernels that "look the same" can have a 30× wall-clock difference.

How to diagnose

Three approaches in order of "first thing to do":

  1. Read your access pattern. Across a warp (variable: threadIdx.x from 0 to 31), do the addresses you compute form a contiguous run? If yes, coalesced.
  2. Profile with Nsight Compute. The Memory Workload Analysis section reports "L2 transactions per access" — a number close to 1 is good; close to 32 is bad. l1tex__t_sector_hit_rate and similar metrics expose the controller's view.
  3. Compare achieved bandwidth. Run nvprof --print-gpu-trace or Nsight. If your kernel transfers D bytes but takes D / (BW/32) seconds, it's strided.

The five common patterns and how they coalesce

PatternCodeCoalesced?
Linear traversalarr[blockIdx.x*blockDim.x + threadIdx.x]
Row of 2D, fast dimarr[row][threadIdx.x]
Column of 2D, fast dimarr[threadIdx.x][col]✗ stride N
Padded 2D (pad=1)arr[row][threadIdx.x + 1]✓ (with one wasted byte at line boundary)
Vectorised load (float4)((float4*)arr)[i]✓ — 16 bytes per thread, 4× wider

That last row is the vector-load optimisation. Casting to float4* lets each thread load 16 bytes in one instruction. The warp issues a 32 × 16 = 512-byte load — which the controller still services as 4 cache-line transactions (the same count as 4 separate coalesced fp32 loads would take). The win isn't fewer transactions vs a coalesced scalar load (it's the same); it's fewer load instructions — 1 vector load instead of 4 scalar loads. Less issue pressure, more compute slots between loads. Vector loads do not fix uncoalesced access patterns — those still cost the controller the same scattered transactions.

What "stride 2" looks like in practice

The "interleaved real/imag" trap: if you have a complex array stored as float arr[N*2] with real at even indices and imag at odd, and a thread reads only the real parts, the access pattern arr[2*i] is stride 2 across the warp. Two transactions per warp instead of one. The fix is to separate the arrays (struct-of-arrays vs array-of-structs).

The structural lesson — SoA vs AoS

Consider: which is faster on the GPU?

// Array of structs
struct Particle { float x, y, z, vx, vy, vz; };
Particle p[N];
// Thread i reads p[i].x → strided by 6 floats

// vs Struct of arrays
struct Particles { float x[N], y[N], z[N], vx[N], vy[N], vz[N]; };
Particles ps;
// Thread i reads ps.x[i] → coalesced

The SoA layout wins on GPUs every time. CPU programmers often default to AoS (better cache locality for one element); GPU programmers default to SoA (better coalescing across a warp). Both are right for their hardware.

Why writes deserve special care

Global stores on Hopper skip L1 and go through L2 (which is write-back, so a coalesced store typically lands in L2 just like a read would). The coalescing rules are the same — uncoalesced writes mean separate transactions for each cache line. The cost gap to reads is smaller than people often think; the structural rule is what matters: 32 threads writing to consecutive addresses = 1 transaction.

So when you have a choice between "coalesced read + strided write" and "strided read + coalesced write", prefer the latter. The shared-memory transpose (lesson 24) buys you both.

Interactive · access-pattern visualizer

Pick a stride. The widget plots which cache lines a warp touches and reports the resulting transaction count and effective bandwidth fraction.

Warp access pattern · cache-line transactions
32 threads in a warp, each loading access_size bytes at offset stride · threadIdx.x. Cells = 4-byte slots. Highlighted = touched by at least one thread.
cache lines touched
bytes requested
bytes transferred
efficiency
Takeaway
Memory loads happen at cache-line granularity (128 bytes). A warp's 32 threads ideally read 32 consecutive elements, hitting 1 line, 1 transaction. Any stride larger than the element size scatters the warp across multiple lines, multiplying transactions. The fix is structural — Struct-of-Arrays, vector loads, transpose-via-SMEM. Get this right and your kernels run at memory-bandwidth-roofline speed.