Sequence and context parallel — sharding the time axis

TP shards the feature dim. PP shards the layer dim. Neither shards the sequence. As contexts grow past 32k tokens, the activations that scale with T (not d) become the bottleneck. Sequence parallel reclaims them cheaply; context parallel reclaims attention itself.

The activations TP doesn't reach

Inside a TP'd transformer block, the matmuls operate on (B, T, d) tensors sharded along d. After the AllReduce that closes each block, every rank has the full (B, T, d) output. Then LayerNorm, dropout, and the residual connection happen on that full tensor — TP doesn't help, because those ops don't have a "feature dim to shard."

Memory accounting per rank, around a TP block:

Inside the matmul: B · T · d / N floats per rank — TP-sharded. Small.
Around LayerNorm / dropout: B · T · d floats per rank — not sharded. Replicated.

For B=4, T=8192, d=8192, the unsharded chunk is ~537 MB per layer. Across 80 layers, even with activation checkpointing it sums to gigabytes that don't shrink as you add TP ranks. Sequence parallel (SP, Korthikanti et al. 2022) fixes this for free.

SP — shard along the sequence axis

The trick is: at the LayerNorm/dropout boundary, shard the activation along T instead of d. Each rank holds B · T/N · d floats. To enter the next TP'd matmul (which needs the full d), AllGather along T. To exit a TP'd matmul, ReduceScatter along T.

Crucially, the combination (AllGather along T) + (ReduceScatter along T) moves exactly the same bytes as a plain AllReduce — the bandwidth cost is identical. SP doesn't add comm; it reshapes the existing AllReduce into two halves and puts the LayerNorm/dropout work inside the sharded region.

Equivalent in cost, free in memory. SP is almost always on in production TP setups.

The attention problem at long context

SP shrinks the LayerNorm/dropout activations — but the attention matmul itself stays as O(B · T · h · d_k) activations per rank. For T = 1M tokens that's hundreds of gigabytes. Even FlashAttention's tile-by-tile streaming computes the same final values — it just avoids materialising the T×T attention matrix.

What we need is to shard the sequence for attention itself: each rank holds T/N tokens' worth of Q, K, V, and somehow computes the same full attention. The catch: each query at position i attends to all positions j ≤ i (causal), which lives on different ranks.

The one identity Ring Attention needs — online softmax

Everything below rests on a single algebraic fact: softmax can be computed incrementally, one chunk of scores at a time, with an exact running result. Derive it once here and the rest of the lesson — and lesson 19's FlashAttention — is just bookkeeping.

Plain (stable) softmax over a row of scores s_1 … s_T subtracts the max for numerical safety:

m = max_j s_j, ℓ = Σ_j e^{s_j − m}, O = Σ_j e^{s_j − m} · v_j, attention = O / ℓ

The catch that seems to forbid sharding: m and ℓ need the whole row before you can normalise. Online softmax breaks that dependency. Process the scores in chunks, carrying three running statistics (m, ℓ, O). When a new chunk arrives with local max m_c:

New running max: m' = max(m, m_c).
Rescale the old state by α = e^{m − m'} ≤ 1 — this retroactively corrects every earlier term from "minus old max" to "minus new max".
Fold in the chunk: ℓ' = α·ℓ + Σ_{j∈chunk} e^{s_j − m'} and O' = α·O + Σ_{j∈chunk} e^{s_j − m'} v_j.

After the last chunk, O/ℓ is bit-for-bit the same value a single-pass softmax would produce — the rescale α exactly undoes the fact that early chunks used a stale max. Two consequences fall out:

FlashAttention (lesson 19) is this recurrence applied within one GPU's SRAM, sweeping K/V tiles so the T×T score matrix never touches HBM.
Ring Attention is the same recurrence applied across ranks. Because the update is associative — folding chunk order doesn't change O/ℓ — each rank can fold in K/V chunks as they arrive over the network. That associativity is the entire reason the next section works.

The "online-softmax accumulator" widget further down lets you scrub through the (m, ℓ, O) rescaling chunk by chunk.

Ring Attention (Liu et al. 2023)

The insight: FlashAttention's online softmax accumulator (m, ℓ, O) is associative — you can extend the partial result by feeding in another chunk of K, V at a time and stably update the running max m, normaliser ℓ, and output O. So if rank i holds Q chunk i and K, V chunk i, it can:

Compute its local Q_i · K_i^T contribution to attention.
Send K_i, V_i to the next rank in a ring.
Receive K_{i-1}, V_{i-1} from the previous rank, compute the cross contribution, fold it into the online accumulator.
Repeat for N - 1 rounds: at the end every rank has attended to all K, V.

Communication: rank i sends and receives N - 1 chunks of size B · T/N · h_kv · d_k · 2 bytes total. Crucially this is concurrent with compute — chunk i+1 can be transferring while chunk i is being processed. As long as comm ≤ compute per chunk, the cost is hidden.

Why ring attention works only because of FlashAttention

Plain attention computes softmax(QK^T)V — the softmax denominator requires the full row of QK^T at once. Naïve sharding of K, V would force you to either materialise the full T×T matrix or recompute things. The online-softmax recurrence derived above makes the softmax decomposable across K, V chunks: keep running statistics (m, ℓ, O), fold in each new chunk with the α = e^{m−m'} rescale. Ring Attention is exactly that recurrence distributed across ranks; lesson 19 applies the identical recurrence inside one GPU's SRAM.

Animated · Ring Attention, one rotation at a time

Below is the same algorithm but rendered as a literal ring of N = 4 ranks. Each rank starts owning (Q_i, K_i, V_i) for its sequence chunk and an online-softmax state (m_i, ℓ_i, O_i). On each tick, every rank uses its current K/V chunk to update its accumulator, then passes K/V to its neighbour. After N - 1 rotations, every rank has seen every K/V chunk — without ever materialising the full attention matrix.

2D · the sequence, sharded

SP and CP both shard the sequence axis, but the scope of what they protect differs. SP shards only the activations around LayerNorm/dropout — the matmul rebuilds the full sequence first. CP shards the QKV and the attention math itself. Below is a long token strip; toggle SP vs CP and watch the colored ownership change. Hover the legend bars for what each rank actually computes.

Animated · online-softmax accumulator

The reason Ring Attention works at all is that FlashAttention's online softmax can ingest K/V chunks sequentially and still produce the exact same numerical result as a single pass. Below is that mechanic, isolated: drag through the chunks, watch the running statistics (m, ℓ, O) rescale themselves stably as each chunk arrives.

Causal masking gets a free ride

For causal LMs, rank i only needs to attend to K, V chunks j ≤ i. The ring still rotates K, V through all N - 1 positions (every rank still has to see every other chunk), but ~half the matmuls are skipped because the upper-triangular contributions vanish. Naïve striping leaves later ranks doing more work than earlier ranks; zigzag and striped schedules (Liu et al., Megatron-LM) reorder chunks across ranks so each one ends up with roughly equal compute. Default Megatron-LM uses striped CP.

SP vs CP — different sizes, different problems

Strategy	What it shards	When it bites	Comm
SP (sequence parallel)	LayerNorm/dropout activations	Whenever TP is used and T > ~2k	Free (replaces AllReduce with AllGather+ReduceScatter)
CP (context parallel)	Attention computation itself	T > ~32k	1 ring rotation of K,V per attention block; comm volume ~B · T · 2 · h_kv · d_k per layer

SP is on by default whenever TP > 1 in modern frameworks (Megatron-LM, NeMo). CP is opt-in for long-context pretraining and fine-tuning. The two compose: a rank can be both TP=8 and CP=4, giving 32 ranks total for one layer's work — useful for 1M-token training.

Cost detail — when CP becomes profitable

FlashAttention's compute on one rank for T tokens scales as O(B · T² · h · d_k). CP shards this across N ranks so each rank does O(B · T² / N · h · d_k) compute. The ring sends O(B · T · h_kv · d_k) bytes per rank per round, N - 1 rounds, so total O(B · T · h_kv · d_k · N) — which divides by NVLink bandwidth. Compute-to-comm ratio in CP:

compute / comm ∝ T · (h · d_k) / (h_kv · d_k · N · BW_per_FLOP)

It's linear in T. The longer your context, the more compute per byte of K, V you move — so CP gets more efficient at longer contexts. This is the opposite of the bandwidth trap TP and FSDP have. CP is the only parallelism strategy in this series that likes long sequences.

Interactive · how much activation memory does SP/CP save?

Slide T and the parallelism axes. The widget plots activation memory per rank for: no SP/CP, SP only, CP only, both. Watch the long-context regime: at T = 128k, CP is the only thing that matters; below T = 4k, SP alone is enough.

The alternative: DeepSpeed-Ulysses

Ring Attention is not the only way to shard attention across the sequence. DeepSpeed-Ulysses takes a different route to the same goal, and the two trade off cleanly.

Ring Attention rotates K/V around a ring of ranks (point-to-point Send/Recv), folding each chunk into the online-softmax accumulator. It scales to any sequence length — add ranks, shorten the chunks — and its comm volume is O(\text{seq}): you move all the K/V, which grows with T.

Ulysses instead does an all-to-all that reshards the activations. Going into attention, the tensors are sequence-sharded (each rank holds T/N tokens, all heads). One all-to-all transposes this to head-sharded: each rank now holds all T tokens but only h/N attention heads. Attention then runs locally and completely on those heads — no ring, no online-softmax distribution — and a second all-to-all transposes back to sequence-sharded for the next block. The comm is only O(\text{hidden}) per token (you move activations, not the quadratic K/V interaction), which is far less volume than Ring at long context. The cost: context parallelism is capped at the number of heads (N \le h, or h_{kv} under GQA — you can't shard 8 KV heads across 16 ranks), and all-to-all wants a high-bandwidth all-to-all fabric, which is where it bites on multi-node.

	Ring Attention	DeepSpeed-Ulysses
Comm pattern	P2P ring of K/V (N−1 rotations)	2 all-to-alls (reshard seq ↔ heads)
Comm volume	O(\text{seq}) — moves all K/V	O(\text{hidden}) per token — much less
Scaling limit	Unbounded — any T, any N	N \le h (or h_{kv} under GQA)
Fabric want	Tolerates P2P, hides behind compute	Needs all-to-all bandwidth

In practice they compose: Ulysses for the head-bounded inner factor, Ring for scaling past the head count. Megatron-style CP defaults to Ring/striped; DeepSpeed ships Ulysses.

Takeaway

SP shards the activations TP can't reach, at no comm cost — turn it on whenever TP is on. CP shards the attention itself by passing K/V around a ring and reusing FlashAttention's online-softmax accumulator; comm is hidden behind compute and the ratio improves with sequence length. SP is universal; CP is the long-context lever.