rl_lessons / 22b · long-tail rollouts lesson 8⅔ / 9 · part III

Long-tail rollouts — the max-of-K problem, packing, and dynamic K

The throughput equation (lesson 22a) said rollout dominates at 60–80% of wall-clock. This lesson explains why most of that isn't raw decode — it's the tail. A small fraction of trajectories generate most of the wall-clock, and three patches — sequence packing, length capping, dynamic K — collectively cut τ_R by 2–4× without changing a single FLOP.

Where this lesson sits
Second in the three-lesson throughput sub-track. Lesson 22a derived τ_step from the loop and pointed at rollout as the dominant term. This lesson stays inside τ_R and asks the next question: given that rollout dominates, what inside rollout dominates? The answer reshapes how you schedule K-rollout batches.

The first-principles observation: rollout latency is the max, not the sum

Per RL step you sample B · K trajectories — one prompt batched K ways for group-baseline algorithms (lessons 11–14). The trainer cannot start until every trajectory finishes, because the loss touches all K rollouts of each prompt to compute the group mean. τ_R is the time until the slowest trajectory completes, not the average:

τ_R = maxi = 1..B·K τtraj,i

If trajectories had identical length, this would collapse to mean × (steps to decode), and the analysis would end here. They don't. Two structural reasons:

response length (tokens) density long tail ~5% of rollouts ~30–50% of decode tokens median ≈ 500 p99 ≈ 5000 a typical reasoning-RL run's per-rollout length distribution

The max-of-K math, in one line

How does τ_R grow with K when lengths are heavy-tailed? Take lengths i.i.d. log-normal with mean μ and coefficient of variation c = σ_L / μ. On the log scale, lengths are Gaussian with standard deviation s = √(ln(1 + c²)). The typical maximum of K samples follows the extreme-value approximation:

maxK L ≈ μ · exp( s · √(2 ln K) − s²/2 )

The exponential's argument grows like s · √(ln K) — slow in K, fast in the tail width s. Numbers from the formula:

Kc = 0.5 (mild)c = 1.0 (typical reasoning)c = 2.0 (open-ended)
42.0 μ2.8 μ3.7 μ
162.7 μ5.0 μ8.9 μ
643.5 μ7.8 μ17.4 μ
2564.3 μ11.3 μ30.6 μ

Two consequences worth pinning to a wall:

The reframe
Naive intuition: "lower K to speed up rollout." Wrong, because lowering K hurts gradient SNR (lesson 11) and only logarithmically reduces τ_R. Right intuition: cap the tail, not the K. Anything that lowers c — length cap, early-stop heuristic, faster EOS — pays back nearly linearly in τ_R, while a smaller K only pays logarithmically.

The three patches, in order of leverage

timeline of one step, K=8 trajectories static pad τ_R = 5000 (worst case) cont. batch τ_R = 5000 (still bound by tail) cap+dynK τ_R ≈ 3000 (tail capped) blue = padded waste · green = real work · purple = backfill from a fresh prompt

Patch 1 — Sequence packing

The original sin of naive batched inference is padding to the longest sequence in the batch. If your batch has lengths [120, 300, 5000, 80, ...], every token-step computes attention over the 5000-position pad for the short sequences. With paged KV (lesson 20) and continuous batching (lesson 21), the rollout engine never pads — it manages each sequence's KV blocks independently and only attends over real tokens. The trainer needs its own version of this:

Sequence packing for the trainer typically buys 1.5–3× throughput on the trainer side. It does not help τ_R directly — the rollout engine already doesn't pad — but it makes the trainer side fast enough that disaggregated topology becomes worth running.

Patch 2 — Length cap with shaped penalty

The simplest patch and the highest leverage. Set a hard cap L_max; any trajectory that reaches it without emitting EOS is truncated and its reward shaped down (DAPO calls this "overlong soft penalty" — lesson 13).

Two design choices that distinguish a good cap from a leaky one:

Empirically a well-tuned cap cuts τ_R by 30–50% with negligible accuracy cost on verifiable tasks.

Patch 3 — Dynamic K and oversampling

The other axis. Standard RL fixes K (say, K=16) and waits for all K trajectories per prompt. The DAPO observation (lesson 13) is that some prompts produce all-equal rewards anyway — those groups contribute zero gradient regardless of how many rollouts you collected. Dynamic sampling oversamples prompts speculatively and drops degenerate groups before they enter the loss.

Two flavors:

The trade is wasted decode FLOPs for wall-clock. On a cluster where rollout dominates τ_step, that trade is almost always worth it — you're throwing away cheap stragglers to claim expensive idle time on the trainer side.

Interactive: straggler tax simulator

The widget simulates one rollout step of K trajectories drawn from a log-normal length distribution. Pick K, mean length, coefficient of variation, and which patches are enabled. The plot shows individual trajectory durations as bars; the KPIs report τ_R, mean utilization, and the "straggler tax" (max / mean − 1).

Straggler tax simulator
Each bar is one of K trajectories. Bar length = decode time. Without patches, τ_R is the longest bar. Toggle the patches and watch wall-clock collapse.
τ_R (max)
mean traj
straggler tax
decode FLOPs saved
0%
What to try. (1) K=16, CV=1.0, no patches: tax ≈ 200–300% — you wait for one bar that's 3–4× the median. (2) Add length cap: tax falls below 100%; the longest bars get clipped. (3) Add dynamic K on top: extra kill on degenerate-group stragglers. (4) Crank CV to 2.0 with all patches off: tax explodes past 500% — this is the open-ended-generation regime that makes a long-tail conversation model painful to train without caps.

Sequence packing — the rollout side and the trainer side

Packing means different things on each side of the disaggregated topology, and conflating them is a common source of pipeline-design confusion.

SideWhat "packing" meansKernelThroughput win
Rollout (inference)Continuous batching: as soon as a sequence finishes, start a new one in its KV slot. No padding ever exists; the batch is always full.PagedAttention + scheduler (lesson 21)2–20× vs static batching
Trainer (forward+backward)Concatenate variable-length trajectories into one sequence with cu_seqlens; FlashAttention varlen computes block-diagonal attention without padding waste.FlashAttention varlen, chunked CE1.5–3× vs padded batch

One non-obvious interaction: trainer-side packing changes the gradient's "natural unit" from "per padded slot" to "per trajectory." The loss must be normalized per-trajectory, not per-token, or longer trajectories will dominate the gradient (length bias — exactly what DAPO's token-level loss patches). Sequence packing without per-trajectory normalization is a silent footgun: throughput goes up, but the algorithm subtly changes.

Dynamic K and the kept-fraction tradeoff

The exact stat to log when you run dynamic K is the kept fraction: k = (rollouts that reached the loss) / (rollouts that ran decode). Three regimes:

One way to think about it: the data pipeline (lesson 18a) keeps the offline kept-fraction high by stratifying prompts; dynamic K keeps the online kept-fraction high by killing groups that didn't pan out. The two patches compose — applying both gets you closer to a regime where every decoded token is useful.

FP8 / mixed-precision asymmetry

One last lever specific to the rollout side. The trainer and the rollout engine don't need the same precision:

The silent bug surface
When trainer and rollout run at different precisions, the log-prob the trainer computes for a token differs from the log-prob the rollout engine sampled at. Lesson 25 calls this "log-prob mismatch" and names it the highest-impact silent bug in RL infra. The usual fix is to recompute old_logp on the trainer side after the rollout returns (the strict option that adds 10–15% to step time); the kernel-parity alternative — match the trainer's and rollout's numerics carefully enough that the gap is negligible — also works but is harder to maintain. The throughput benefit of FP8 rollout only holds if one of these is engineered in; otherwise you will see clip-fraction climbing and gradient quality dropping, with no obvious cause.

Putting the patches together

For a representative 7B run with K=16, mean L=800, CV≈1.0, the patches compose roughly as follows. Each row applies the row above plus the new patch.

Configurationτ_R relativeWhat changed
Baseline (static pad, no cap, fixed K)1.00×
+ paged KV + continuous batch0.30×no padding waste; reuse slots
+ length cap at p95 with soft penalty0.18×tail truncated, gradient preserved on success
+ dynamic K (kill on settled groups)0.13×online filter of degenerate groups
+ FP8 rollout w/ log-prob recompute0.08×halve bytes/token on decode

An order of magnitude on the dominant term of τ_step, without changing the algorithm or the model. This is what "throughput optimization" looks like on the rollout side.

Takeaway
τ_R is the max over K·B trajectories, not the mean — so heavy-tailed length distributions make the straggler dominate wall-clock. Sequence packing eliminates padding waste; length cap with shaped penalty truncates the tail; dynamic K kills already-settled groups; FP8 halves the decode bytes/token. The compose order matters because each lowers the dominant slice of the previous bar. After all four, rollout drops from ~75% of τ_step to ~25% — and the next-largest term becomes the new optimization target.