Warps, divergence, sync
A warp's 32 threads run the same instruction together — so when they disagree, the warp serialises the branches and runs each one with some lanes masked off. The cost is real; the fix is to keep your branching decisions warp-aligned, or to use predication when you can't.
What "divergence" means, exactly
From lesson 20, a warp issues one instruction per cycle (per partition). All 32 threads in the warp run that instruction at the same time. So what happens when the source code is:
if (threadIdx.x < 16) {
a = foo(); // path A
} else {
a = bar(); // path B
}
Threads 0..15 want path A; threads 16..31 want path B. The warp can't do two things at once. The hardware solution:
- Run path A with threads 16..31 masked off (their execution units run, but their results don't get written back).
- Run path B with threads 0..15 masked off.
- Reconverge: both branches done, all threads resume in lockstep.
So the warp takes time_of_path_A + time_of_path_B instead of max. If both branches are equally costly, the warp runs at ~50% throughput. If three or four mutually-exclusive branches are taken, throughput drops by 3× or 4×.
Animated · the divergent branch, stepped phase-by-phase
The SVG above is a still. Below is the same scenario animated: 32 threads enter an if/else, the warp serialises into two sequential phases, and the inactive lanes sit idle (faded) in each. Use the predicate selector to drive different masks. Watch the wall-clock counter at the bottom.
When divergence is free
Divergence is only a problem when threads in the same warp diverge. If all 32 threads in a warp take the same branch, there's no divergence:
if (blockIdx.x == 0) { // branch decided per block, not per thread
...
}
if (threadIdx.x / 32 == 0) { // branch decided per warp, threads 0..31 agree
...
}
if (some_input[0] > 0) { // value read from memory, but the same for every thread
...
}
All of these are warp-uniform — every thread in the warp takes the same path. The compiler emits a single conditional branch (a uniform BRA with a predicate that's identical across the warp); the warp executes one side, no serialisation.
Predication — the cure for short branches
For very short conditional bodies, the compiler may transform the branch into a predicated form:
// branch version
if (x < 0) y = 0; else y = x;
// predicated equivalent (both paths execute; selector picks)
y = (x < 0) ? 0 : x;
In PTX, the compiler emits the equivalent of:
cmp p, x, 0
@p mov y, 0
@!p mov y, x
Both instructions execute on all threads, but each is gated by its predicate. There's no branch, no serialisation. Predication runs in the same number of cycles as a non-branch sequence. The trade is that both sides of the conditional do work — fine if the work is tiny (one or two instructions), bad if either side is expensive.
You can hint to the compiler that you want predication via the ?: operator or by keeping bodies short. nvcc is reasonably aggressive about this.
__syncthreads() — the block-level barrier
From lesson 24 we used __syncthreads() to coordinate SMEM writes and reads. The rules:
- All threads in the block must reach it. If some threads exit the kernel before others reach the barrier, the kernel deadlocks. So never put
__syncthreads()inside a divergent branch where some threads might not enter. - It's cheap. Tens of cycles per call on H100. Use it freely when you need it; just don't put it inside loops where it's not needed.
- It synchronises only one block. No cross-block barrier exists in classic CUDA (the cooperative-groups grid barrier is rare and slow). Cross-block sync is "end the kernel and start a new one."
The mistake-that-deadlocks:
if (threadIdx.x < 16) {
foo();
__syncthreads(); // ✗ threads 16..31 never reach this, deadlock
}
If you need divergent setup followed by a sync, structure the code so all threads run through the barrier:
if (threadIdx.x < 16) foo();
__syncthreads(); // ✓ all threads reach this
2D · thread-mask visualizer, predicate the warp by anything
Each cell below is one of the warp's 32 lanes. Pick a predicate and the active mask updates: bright lanes will execute the body, faded ones will be masked. The 32-bit hex mask in the corner is what __activemask() or __ballot_sync would actually return.
Warp-level primitives — sync without barriers
Threads within a warp don't need an explicit barrier — they're already in lockstep. But they often need to communicate with each other. The warp shuffle instructions let one thread read another thread's register directly, without going through SMEM:
| Instruction | What it does |
|---|---|
__shfl_sync(mask, val, srcLane) | Each thread receives val from lane srcLane |
__shfl_xor_sync(mask, val, laneMask) | Pair-swap by XORing the lane id |
__shfl_up_sync(mask, val, delta) | Each thread receives val from laneId - delta |
__shfl_down_sync(mask, val, delta) | Receives from laneId + delta |
The mask is a bitmask of which lanes participate (usually 0xFFFFFFFF for all 32). The "sync" in the name reflects the fact that since Volta (CC 7.0), independent thread scheduling means you need to explicitly tell the hardware which lanes synchronise — the _sync variants are mandatory and the older __shfl is deprecated.
Warp shuffles are the workhorse of warp-level reductions (lesson 26 dives in). They're roughly 10× cheaper than going through SMEM for the same data movement.
Animated · __shfl_xor butterfly reduction
The most common warp-shuffle pattern is the butterfly reduction. Start with one value per lane; pair lanes via XOR at exponentially growing offsets (1, then 2, 4, 8, 16) and add. This is the canonical CUDA idiom — every CUB / cuBLAS warp reduction starts by swapping adjacent pairs (offset 1) and ends by swapping halves (offset 16). Unlike a tree reduction, every lane participates in every round — there is no halving of active lanes. After 5 rounds every lane holds the full warp sum (which is why butterfly is also called "all-reduce"). The animation shows the pairing arrows and the running register values lane by lane.
Independent thread scheduling — the subtle 7.0+ change
Pre-Volta, the 32 threads of a warp had a single program counter. Branches forced serialisation; "active mask" tracked which threads were live.
Volta (CC 7.0) added per-thread program counters. Threads can now diverge and reconverge independently. In practice this fixes some classic warp-level data-race bugs (where a programmer assumed lockstep), but it means that without explicit _sync calls, the compiler doesn't guarantee threads are at the same instruction.
If you read older CUDA tutorials and they use __shfl (no sync), __ballot (no sync), etc. — those are pre-7.0 idioms. Modern code uses the _sync variants with an explicit lane mask.
Two patterns that look divergent but aren't, and one that is
| Code | Divergent? | Why |
|---|---|---|
if (input[i] > threshold) work(); | Usually yes | input[i] varies per thread |
if (threadIdx.x % 2 == 0) a(); else b(); | Yes | Every other thread takes different paths within one warp |
if (threadIdx.x / 32 == 0) a(); else b(); | No | Branch boundary aligns with warps; warp 0 takes a(), warp 1 takes b() |
if (blockIdx.x == 0) a(); else b(); | No | Branch is per-block, all threads in the block agree |
When divergence is actually fine
Several common cases are fine despite divergence:
- Predicated short branches. The compiler turns them into predicated execution; no serialisation.
- Branches where one side is trivial. Even if 16 threads run "do nothing" and 16 run "do real work", the wall-clock is the same as the work side.
- Branches that determine kernel exit. Boundary-check guards (
if (i < N) work();) only matter for the last partial warp. For all other warps, all threads pass the check; no divergence.
The boundary check is the most common "divergence" in real code and you should not worry about it. The cost is at most one warp per launch.
Interactive · divergence cost calculator
Pick a per-thread branching pattern. The widget shows how many distinct paths a warp takes, and the resulting throughput as a fraction of peak.
__syncthreads() is the block barrier; warp shuffles are the warp-level fast lane. Independent thread scheduling on Volta+ means always use the _sync variants of warp primitives — the older __shfl/__ballot without sync are correctness traps.