all_lessons / ml_system_design / 09c · async rollout dataflow RL systems 3 / 4

Async rollout, derived from the longest trajectory

09a found that rollout is usually the dominant term — ~95% of the step for reasoning RL. This lesson opens the rollout box and finds the reason async exists hiding inside one fact: a synchronous batch finishes when its slowest trajectory finishes, not its average one. Everything else — partial rollout, repacking, streaming queues, bounded staleness — is forced by that, and bounded by the off-policy cost of fixing it.

The design target
Keep actors busy, keep the learner fed, and keep every sample traceable to the policy version that produced it. The first two are throughput; the third is correctness. Throughput without version accounting is just a faster way to train on the wrong distribution — so this lesson derives both halves and the knob between them.

1 · Why the global batch is hostage to its longest trajectory

A synchronous step samples K completions per prompt and cannot start training until all of them return. So the rollout time is a max, not a mean (RL 22b):

τR = maxi ∈ K τtraj, i   (not  meani τtraj, i)

This would be harmless if lengths were tight. They are not — reasoning outputs are heavy-tailed: most completions are short, a few think for 10,000+ tokens. For a length distribution with coefficient of variation c = σ/μ, the expected maximum of K draws grows like (lognormal extreme-value):

Lmax(K) ≈ μ · exp( s·√(2 ln K) − s²/2 ),   s = √(ln(1 + c²))

Plug in the typical reasoning-RL regime — K = 16, c = 1 — and the longest of the 16 is about 5× the mean. Since decode time is roughly linear in length (lesson 04), the batch spends most of its wall-clock waiting on one straggler while 15 actors sit idle. That idle is not a tuning problem; it is the shape of the distribution. This is the entire motivation for breaking the synchronous barrier.

The asymmetry that decides where to spend effort
Lmax grows with √(ln K) but with the full width c. So lowering K helps only logarithmically (halving K barely moves the max), while capping the tail width pays back nearly linearly. That is why the fixes below all attack the tail — length caps, repacking, partial rollout — rather than just shrinking the batch.

2 · The fixes are all "stop waiting for the straggler"

Once you see τR as a max over a heavy tail, every rollout optimization is the same move — decouple the fast trajectories from the slow one so the batch boundary no longer pins the cluster. They compose multiplicatively (RL 22b's ladder, relative to a static-padded baseline of 1.0×):

PatchWhat it changesCumulative τR
Static padding (baseline)Pad every sequence to max length, one global batch.1.00×
+ paged KV + continuous batchingNo padding waste; a finished slot is refilled immediately (RL 20).~0.30×
+ length cap at p95 with soft penaltyTruncate the extreme tail; penalize instead of waiting forever.~0.18×
+ dynamic K (kill settled groups)Stop sampling a prompt once its group reward has converged.~0.13×
+ fp8 rollout + logprob recomputeFaster decode; recover exact logprobs on the learner side.~0.08×

The endpoint of that ladder is trajectory-level asynchrony: don't wait for any batch at all. Consume each finished trajectory as it lands, repack stragglers into the next micro-batch, and let the learner step on a stream. Partial rollout (pause a long generation, train, resume it later under newer weights) is the same idea pushed into a single trajectory. But every one of these moves the data off the policy that will train on it — which is the cost we have to price next.

3 · The price of overlap — staleness, and why it is a budget

Break the barrier and the learner is now training on trajectories that were generated by older weights. RL corrects for this with an importance-sampling ratio (RL 06, RL 10): a sample generated under πθ−Δ but trained under πθ is reweighted by

ρt = πθ(yt | y<t) / πθ−Δ(yt | y<t),   clipped to [1−ε, 1+ε]

The correction is only valid while ρ stays near 1. As staleness Δ (in learner steps) grows, the two policies drift apart, more tokens fall outside the clip range, and the gradient becomes biased and high-variance. Empirically the freshness wall sits around Δ ≈ 4–8 steps, where the clipped fraction crosses ~20–25% and learning destabilizes (RL 22c). So:

Async is a dial from on-policy to off-policy, not an on/off switch
Δ = 0 is strictly synchronous: cleanest gradient, worst utilization. Larger Δ buys overlap (the idle_band of 09b) but spends statistical validity. The dial is bounded by the freshness wall, and unlike a pure systems knob, pushing past it hurts the model, not just the clock. The only honest metric is reward at fixed GPU-hours — never steps/hour, which always improves and tells you nothing.

4 · So a trajectory must carry its version — the systems object

Now the data schema is not bureaucracy — it is derived. To compute ρ you need the logprobs and the policy version each token was sampled under; to repack and prioritize you need lengths and rewards; to debug a reward bug you need the env events. The trajectory therefore has to be a record, not a triple:

Trajectory {
  prompt_id, task_id, curriculum_bin,
  policy_version, reference_version,   // ← needed to compute ρ and the KL baseline
  tokens, logprobs, masks, tool_calls, // ← logprobs are the denominator of ρ
  env_events, reward_components, verifier_logs,
  started_at, finished_at, actor_id, env_id,
  length_tokens, wall_time_ms, stale_by_versions  // ← the freshness check
}

That schema is the control surface for every optimization in §2 and §3: partial rollout, dynamic sampling, retry, reward caching, staleness admission, replay prioritization, and regression analysis all read or write fields of it. A framework that logs only "prompt + response + reward" cannot do any of them correctly.

5 · The async graph and where it stalls

Wire those records through services and you get the SOTA rollout plane: a prompt queue feeds an actor fleet, completions go to env/reward, finished trajectories stream into a versioned store, the learner pulls from the store, and a weight service pushes fresh policy back to the actors while a controller enforces admission SLOs.

prompt queue curriculum actor fleet vLLM / SGLang env / reward tests, tools, RM trajectory store stream + version learner PPO / GRPO / off-policy weight service bucket / relay controller admission + SLO SOTA frameworks optimize the queues, placement, and freshness around this loop.

Each edge is a place the plane can stall, and each stall has a fix and a quality risk — the risk being the §3 staleness budget cashed out in a specific way:

StallSymptomFixQuality risk
Decode wallActors at 100%, learner waits for tokens.More actors, vLLM/SGLang, continuous batching, speculative decode.Speculation must preserve the target distribution, or it is no longer on-policy.
Long-tail completionsA few huge responses hold the batch boundary (§1).Partial rollout, repacking, abort/retract, length-aware sampling.Dropping long samples biases toward short reasoning — unless tracked.
Environment tailTests/browsers/tools dominate p95.Remote sandbox pools, cached verifiers, async reward queues, timeouts.Timeout policy becomes a reward-design choice.
Store backpressureActors finish but can't enqueue; learner sees bursty batches.Streaming transfer, sharded queues, priority admission.Queue policy can silently shift the task distribution.
Freshness driftUtilization rises but the reward curve gets worse.Bounded staleness, staleness-aware PPO, recency priority, version gates.Too much stale data turns near-on-policy RL into accidental off-policy RL.

6 · Backpressure and admission — the controller's SLOs

Async does not mean unbounded. The controller enforces a small set of SLOs that encode the §3 budget and the §1 tail directly:

The trap async designs fall into
Many optimize GPU utilization first, then discover the learner is training on a moving mixture of old policies, easy prompts, and short trajectories. Utilization and sample validity must be optimized together — which is only possible because every trajectory carries its version (§4).

SOTA patterns — each is one move on this graph

PatternSystem moveSeen in
Continuous rollout servingTreat actors as inference servers: batching, KV management, prefix reuse, pause/resume for weight updates.OpenRLHF + vLLM; slime + SGLang; verl rollout backends.
Speculative RL rolloutA drafter proposes tokens, the target policy verifies — lossless, so the on-policy distribution is preserved while decode speeds up.NeMo-RL speculative decoding.
Producer-consumer streamingRollout, reward, and learner progress through a streamed queue instead of a global phase barrier.AsyncFlow / Relax TransferQueue.
Trajectory-level asynchronyConsume finished trajectories independently; repack stragglers instead of waiting for the slowest (§1 endpoint).Laminar relay + repacking; AReaL fully-async.
Agent-training disaggregationSeparate arbitrary agent execution from training via a unified transition interface.Agent Lightning; OpenRLHF token-in-token-out agents.
Search/learning decouplingScale exploration actors separately from learners with a replay-compatible objective — for sparse rewards, spend compute on search, not more PPO epochs.Trajectory Balance with Asynchrony.

Interactive · rollout data-plane planner

Compare the synchronous wall (wait for the tail) to an async plane (overlap + repack the tail, pay a freshness penalty if lag grows). Push tail skew up and watch the synchronous wall explode while async stays flat — that gap is §1 made visible.

Async rollout planner

Sync waits for rollout tail + reward tail + train + sync. Async overlaps rollout with training and repacks the tail, but multiplies by a freshness penalty once max lag passes the wall (~4). The recommendation reads the dominant term back to you.

sync loop
-
async loop
-
speedup
-
design call
-

What carries forward

Sources used

SourceSystem idea used
AsyncFlowStreaming data storage, fine-grained scheduling, producer-consumer async workflow.
AReaLFully asynchronous generation/training with workload balancing and staleness-aware PPO.
LaminarTrajectory-level asynchrony, relay workers, and dynamic repacking for long-tail rollouts.
RelaxTransferQueue service decoupling and continuous staleness control.
NeMo-RL speculative decodingSystem-integrated, distribution-preserving speculative decode for sync and async rollouts.
Agent LightningTraining-agent disaggregation and a unified interface for arbitrary agent trajectories.
Trajectory Balance with AsynchronyDecoupling exploration/search from learning with replay-compatible objectives.
OpenRLHF async trainingPartial rollout, vLLM pause/resume, rollout/training overlap.
SGLang for RLSleep/wake, weight-update modes, long-tail rollout controls.