gpu_kernel_serving / 06 · warps, divergence, sync lesson 6 / 17

Warps, divergence, sync

A warp's 32 threads run the same instruction together — so when they disagree, the warp serialises the branches and runs each one with some lanes masked off. The cost is real; the fix is to keep your branching decisions warp-aligned, or to use predication when you can't.

What "divergence" means, exactly

From lesson 20, a warp issues one instruction per cycle (per partition). All 32 threads in the warp run that instruction at the same time. So what happens when the source code is:

if (threadIdx.x < 16) {
    a = foo();      // path A
} else {
    a = bar();      // path B
}

Threads 0..15 want path A; threads 16..31 want path B. The warp can't do two things at once. The hardware solution:

  1. Run path A with threads 16..31 masked off (their execution units run, but their results don't get written back).
  2. Run path B with threads 0..15 masked off.
  3. Reconverge: both branches done, all threads resume in lockstep.

So the warp takes time_of_path_A + time_of_path_B instead of max. If both branches are equally costly, the warp runs at ~50% throughput. If three or four mutually-exclusive branches are taken, throughput drops by 3× or 4×.

divergent branch · warp serialises the paths threads 0..31 (one warp) step 1 · 16 threads execute "path A", others idle (masked) step 2 · 16 threads execute "path B", others idle (masked) · 2× wall-clock for the warp

Animated · the divergent branch, stepped phase-by-phase

The SVG above is a still. Below is the same scenario animated: 32 threads enter an if/else, the warp serialises into two sequential phases, and the inactive lanes sit idle (faded) in each. Use the predicate selector to drive different masks. Watch the wall-clock counter at the bottom.

Divergent branch · serialised phases
One warp = 32 lanes. The compiler emits two passes: lanes whose predicate is true run, the others idle. Total cycles ≈ A + B.
phase
active lanes
elapsed cycles
throughput vs ideal

When divergence is free

Divergence is only a problem when threads in the same warp diverge. If all 32 threads in a warp take the same branch, there's no divergence:

if (blockIdx.x == 0) {        // branch decided per block, not per thread
    ...
}
if (threadIdx.x / 32 == 0) {  // branch decided per warp, threads 0..31 agree
    ...
}
if (some_input[0] > 0) {      // value read from memory, but the same for every thread
    ...
}

All of these are warp-uniform — every thread in the warp takes the same path. The compiler emits a single conditional branch (a uniform BRA with a predicate that's identical across the warp); the warp executes one side, no serialisation.

Predication — the cure for short branches

For very short conditional bodies, the compiler may transform the branch into a predicated form:

// branch version
if (x < 0) y = 0; else y = x;

// predicated equivalent (both paths execute; selector picks)
y = (x < 0) ? 0 : x;

In PTX, the compiler emits the equivalent of:

cmp p, x, 0
@p   mov y, 0
@!p  mov y, x

Both instructions execute on all threads, but each is gated by its predicate. There's no branch, no serialisation. Predication runs in the same number of cycles as a non-branch sequence. The trade is that both sides of the conditional do work — fine if the work is tiny (one or two instructions), bad if either side is expensive.

You can hint to the compiler that you want predication via the ?: operator or by keeping bodies short. nvcc is reasonably aggressive about this.

__syncthreads() — the block-level barrier

From lesson 24 we used __syncthreads() to coordinate SMEM writes and reads. The rules:

The mistake-that-deadlocks:

if (threadIdx.x < 16) {
    foo();
    __syncthreads();   // ✗ threads 16..31 never reach this, deadlock
}

If you need divergent setup followed by a sync, structure the code so all threads run through the barrier:

if (threadIdx.x < 16) foo();
__syncthreads();       // ✓ all threads reach this

2D · thread-mask visualizer, predicate the warp by anything

Each cell below is one of the warp's 32 lanes. Pick a predicate and the active mask updates: bright lanes will execute the body, faded ones will be masked. The 32-bit hex mask in the corner is what __activemask() or __ballot_sync would actually return.

Active-mask explorer · 32 lanes
Bright = predicate true (lane executes); faded = predicate false (lane masked). The hex string is the 32-bit ballot.
active lanes
mask (hex)
useful work
divergent?

Warp-level primitives — sync without barriers

Threads within a warp don't need an explicit barrier — they're already in lockstep. But they often need to communicate with each other. The warp shuffle instructions let one thread read another thread's register directly, without going through SMEM:

InstructionWhat it does
__shfl_sync(mask, val, srcLane)Each thread receives val from lane srcLane
__shfl_xor_sync(mask, val, laneMask)Pair-swap by XORing the lane id
__shfl_up_sync(mask, val, delta)Each thread receives val from laneId - delta
__shfl_down_sync(mask, val, delta)Receives from laneId + delta

The mask is a bitmask of which lanes participate (usually 0xFFFFFFFF for all 32). The "sync" in the name reflects the fact that since Volta (CC 7.0), independent thread scheduling means you need to explicitly tell the hardware which lanes synchronise — the _sync variants are mandatory and the older __shfl is deprecated.

Warp shuffles are the workhorse of warp-level reductions (lesson 26 dives in). They're roughly 10× cheaper than going through SMEM for the same data movement.

Animated · __shfl_xor butterfly reduction

The most common warp-shuffle pattern is the butterfly reduction. Start with one value per lane; pair lanes via XOR at exponentially growing offsets (1, then 2, 4, 8, 16) and add. This is the canonical CUDA idiom — every CUB / cuBLAS warp reduction starts by swapping adjacent pairs (offset 1) and ends by swapping halves (offset 16). Unlike a tree reduction, every lane participates in every round — there is no halving of active lanes. After 5 rounds every lane holds the full warp sum (which is why butterfly is also called "all-reduce"). The animation shows the pairing arrows and the running register values lane by lane.

__shfl_xor_sync butterfly · 5 rounds (offsets 1, 2, 4, 8, 16) · all 32 lanes active throughout
At round k, lane l pairs with lane l ⊕ offset. Each lane receives the partner's value and adds. Every lane participates in every round; after the final offset=16 swap, every lane holds the same warp-wide sum.
round
offset (laneMask)
lane-0 value
all-lanes equal?

Independent thread scheduling — the subtle 7.0+ change

Pre-Volta, the 32 threads of a warp had a single program counter. Branches forced serialisation; "active mask" tracked which threads were live.

Volta (CC 7.0) added per-thread program counters. Threads can now diverge and reconverge independently. In practice this fixes some classic warp-level data-race bugs (where a programmer assumed lockstep), but it means that without explicit _sync calls, the compiler doesn't guarantee threads are at the same instruction.

If you read older CUDA tutorials and they use __shfl (no sync), __ballot (no sync), etc. — those are pre-7.0 idioms. Modern code uses the _sync variants with an explicit lane mask.

Two patterns that look divergent but aren't, and one that is

CodeDivergent?Why
if (input[i] > threshold) work();Usually yesinput[i] varies per thread
if (threadIdx.x % 2 == 0) a(); else b();YesEvery other thread takes different paths within one warp
if (threadIdx.x / 32 == 0) a(); else b();NoBranch boundary aligns with warps; warp 0 takes a(), warp 1 takes b()
if (blockIdx.x == 0) a(); else b();NoBranch is per-block, all threads in the block agree

When divergence is actually fine

Several common cases are fine despite divergence:

The boundary check is the most common "divergence" in real code and you should not worry about it. The cost is at most one warp per launch.

Interactive · divergence cost calculator

Pick a per-thread branching pattern. The widget shows how many distinct paths a warp takes, and the resulting throughput as a fraction of peak.

Warp serialisation · paths and throughput
Pick a pattern; the widget computes how many distinct path-buckets the 32 threads fall into, then estimates throughput = 1/buckets.
distinct paths
throughput (single warp)
verdict
Takeaway
Threads in a warp run the same instruction; divergent branches serialise the paths. Keep branch decisions warp-uniform when you can (decide per block or per warp, not per thread). For short branches, predicate. __syncthreads() is the block barrier; warp shuffles are the warp-level fast lane. Independent thread scheduling on Volta+ means always use the _sync variants of warp primitives — the older __shfl/__ballot without sync are correctness traps.