Async rollout, derived from the longest trajectory
09a found that rollout is usually the dominant term — ~95% of the step for reasoning RL. This lesson opens the rollout box and finds the reason async exists hiding inside one fact: a synchronous batch finishes when its slowest trajectory finishes, not its average one. Everything else — partial rollout, repacking, streaming queues, bounded staleness — is forced by that, and bounded by the off-policy cost of fixing it.
1 · Why the global batch is hostage to its longest trajectory
A synchronous step samples K completions per prompt and cannot start training until all of them return. So the rollout time is a max, not a mean (RL 22b):
This would be harmless if lengths were tight. They are not — reasoning outputs are heavy-tailed: most completions are short, a few think for 10,000+ tokens. For a length distribution with coefficient of variation c = σ/μ, the expected maximum of K draws grows like (lognormal extreme-value):
Plug in the typical reasoning-RL regime — K = 16, c = 1 — and the longest of the 16 is about 5× the mean. Since decode time is roughly linear in length (lesson 04), the batch spends most of its wall-clock waiting on one straggler while 15 actors sit idle. That idle is not a tuning problem; it is the shape of the distribution. This is the entire motivation for breaking the synchronous barrier.
2 · The fixes are all "stop waiting for the straggler"
Once you see τR as a max over a heavy tail, every rollout optimization is the same move — decouple the fast trajectories from the slow one so the batch boundary no longer pins the cluster. They compose multiplicatively (RL 22b's ladder, relative to a static-padded baseline of 1.0×):
| Patch | What it changes | Cumulative τR |
|---|---|---|
| Static padding (baseline) | Pad every sequence to max length, one global batch. | 1.00× |
| + paged KV + continuous batching | No padding waste; a finished slot is refilled immediately (RL 20). | ~0.30× |
| + length cap at p95 with soft penalty | Truncate the extreme tail; penalize instead of waiting forever. | ~0.18× |
| + dynamic K (kill settled groups) | Stop sampling a prompt once its group reward has converged. | ~0.13× |
| + fp8 rollout + logprob recompute | Faster decode; recover exact logprobs on the learner side. | ~0.08× |
The endpoint of that ladder is trajectory-level asynchrony: don't wait for any batch at all. Consume each finished trajectory as it lands, repack stragglers into the next micro-batch, and let the learner step on a stream. Partial rollout (pause a long generation, train, resume it later under newer weights) is the same idea pushed into a single trajectory. But every one of these moves the data off the policy that will train on it — which is the cost we have to price next.
3 · The price of overlap — staleness, and why it is a budget
Break the barrier and the learner is now training on trajectories that were generated by older weights. RL corrects for this with an importance-sampling ratio (RL 06, RL 10): a sample generated under πθ−Δ but trained under πθ is reweighted by
The correction is only valid while ρ stays near 1. As staleness Δ (in learner steps) grows, the two policies drift apart, more tokens fall outside the clip range, and the gradient becomes biased and high-variance. Empirically the freshness wall sits around Δ ≈ 4–8 steps, where the clipped fraction crosses ~20–25% and learning destabilizes (RL 22c). So:
4 · So a trajectory must carry its version — the systems object
Now the data schema is not bureaucracy — it is derived. To compute ρ you need the logprobs and the policy version each token was sampled under; to repack and prioritize you need lengths and rewards; to debug a reward bug you need the env events. The trajectory therefore has to be a record, not a triple:
Trajectory {
prompt_id, task_id, curriculum_bin,
policy_version, reference_version, // ← needed to compute ρ and the KL baseline
tokens, logprobs, masks, tool_calls, // ← logprobs are the denominator of ρ
env_events, reward_components, verifier_logs,
started_at, finished_at, actor_id, env_id,
length_tokens, wall_time_ms, stale_by_versions // ← the freshness check
}
That schema is the control surface for every optimization in §2 and §3: partial rollout, dynamic sampling, retry, reward caching, staleness admission, replay prioritization, and regression analysis all read or write fields of it. A framework that logs only "prompt + response + reward" cannot do any of them correctly.
5 · The async graph and where it stalls
Wire those records through services and you get the SOTA rollout plane: a prompt queue feeds an actor fleet, completions go to env/reward, finished trajectories stream into a versioned store, the learner pulls from the store, and a weight service pushes fresh policy back to the actors while a controller enforces admission SLOs.
Each edge is a place the plane can stall, and each stall has a fix and a quality risk — the risk being the §3 staleness budget cashed out in a specific way:
| Stall | Symptom | Fix | Quality risk |
|---|---|---|---|
| Decode wall | Actors at 100%, learner waits for tokens. | More actors, vLLM/SGLang, continuous batching, speculative decode. | Speculation must preserve the target distribution, or it is no longer on-policy. |
| Long-tail completions | A few huge responses hold the batch boundary (§1). | Partial rollout, repacking, abort/retract, length-aware sampling. | Dropping long samples biases toward short reasoning — unless tracked. |
| Environment tail | Tests/browsers/tools dominate p95. | Remote sandbox pools, cached verifiers, async reward queues, timeouts. | Timeout policy becomes a reward-design choice. |
| Store backpressure | Actors finish but can't enqueue; learner sees bursty batches. | Streaming transfer, sharded queues, priority admission. | Queue policy can silently shift the task distribution. |
| Freshness drift | Utilization rises but the reward curve gets worse. | Bounded staleness, staleness-aware PPO, recency priority, version gates. | Too much stale data turns near-on-policy RL into accidental off-policy RL. |
6 · Backpressure and admission — the controller's SLOs
Async does not mean unbounded. The controller enforces a small set of SLOs that encode the §3 budget and the §1 tail directly:
- Max policy lag: reject or downweight trajectories older than
kversions (the freshness wall). - Max rollout wall time: abort, retract, or isolate requests past the task budget (the tail cap).
- Min batch quality: avoid batches dominated by one task, one reward mode, or one length bucket.
- Queue pressure: throttle prompt admission when the store or reward services saturate.
- Replay policy: decide whether old high-reward samples are allowed, and under which objective.
SOTA patterns — each is one move on this graph
| Pattern | System move | Seen in |
|---|---|---|
| Continuous rollout serving | Treat actors as inference servers: batching, KV management, prefix reuse, pause/resume for weight updates. | OpenRLHF + vLLM; slime + SGLang; verl rollout backends. |
| Speculative RL rollout | A drafter proposes tokens, the target policy verifies — lossless, so the on-policy distribution is preserved while decode speeds up. | NeMo-RL speculative decoding. |
| Producer-consumer streaming | Rollout, reward, and learner progress through a streamed queue instead of a global phase barrier. | AsyncFlow / Relax TransferQueue. |
| Trajectory-level asynchrony | Consume finished trajectories independently; repack stragglers instead of waiting for the slowest (§1 endpoint). | Laminar relay + repacking; AReaL fully-async. |
| Agent-training disaggregation | Separate arbitrary agent execution from training via a unified transition interface. | Agent Lightning; OpenRLHF token-in-token-out agents. |
| Search/learning decoupling | Scale exploration actors separately from learners with a replay-compatible objective — for sparse rewards, spend compute on search, not more PPO epochs. | Trajectory Balance with Asynchrony. |
Interactive · rollout data-plane planner
Compare the synchronous wall (wait for the tail) to an async plane (overlap + repack the tail, pay a freshness penalty if lag grows). Push tail skew up and watch the synchronous wall explode while async stays flat — that gap is §1 made visible.
What carries forward
- Rollout time is a max, not a mean: τR = maxi τtraj,i, and with heavy-tailed reasoning lengths the longest of K=16 is ~5× the mean. The synchronous batch is hostage to one straggler.
- Capping the tail beats shrinking K: the max grows like √(ln K) but with the full width c — so length caps, repacking, and partial rollout pay nearly linearly while lowering K barely helps.
- The endpoint is trajectory-level async — consume each finished trajectory immediately — which is why frameworks chase it.
- Overlap costs staleness, priced by the IS ratio ρ = πθ/πθ−Δ; the freshness wall (Δ ≈ 4–8, clip frac ~20–25%) bounds the dial. Measure reward at fixed compute.
- Therefore the trajectory is a versioned record, not a triple — the logprobs and policy version are literally the inputs to the correction, and the control surface for every other optimization.
Sources used
| Source | System idea used |
|---|---|
| AsyncFlow | Streaming data storage, fine-grained scheduling, producer-consumer async workflow. |
| AReaL | Fully asynchronous generation/training with workload balancing and staleness-aware PPO. |
| Laminar | Trajectory-level asynchrony, relay workers, and dynamic repacking for long-tail rollouts. |
| Relax | TransferQueue service decoupling and continuous staleness control. |
| NeMo-RL speculative decoding | System-integrated, distribution-preserving speculative decode for sync and async rollouts. |
| Agent Lightning | Training-agent disaggregation and a unified interface for arbitrary agent trajectories. |
| Trajectory Balance with Asynchrony | Decoupling exploration/search from learning with replay-compatible objectives. |
| OpenRLHF async training | Partial rollout, vLLM pause/resume, rollout/training overlap. |
| SGLang for RL | Sleep/wake, weight-update modes, long-tail rollout controls. |