Async rollout, derived from the longest trajectory

09a found that rollout is usually the dominant term — ~95% of the step for reasoning RL. This lesson opens the rollout box and finds the reason async exists hiding inside one fact: a synchronous batch finishes when its slowest trajectory finishes, not its average one. Everything else — partial rollout, repacking, streaming queues, bounded staleness — is forced by that, and bounded by the off-policy cost of fixing it.

The design target

Keep actors busy, keep the learner fed, and keep every sample traceable to the policy version that produced it. The first two are throughput; the third is correctness. Throughput without version accounting is just a faster way to train on the wrong distribution — so this lesson derives both halves and the knob between them.

1 · Why the global batch is hostage to its longest trajectory

A synchronous step samples K completions per prompt and cannot start training until all of them return. So the rollout time is a max, not a mean (RL 22b):

τ_R = max_{i ∈ K} τ_{traj, i} (not mean_i τ_{traj, i})

This would be harmless if lengths were tight. They are not — reasoning outputs are heavy-tailed: most completions are short, a few think for 10,000+ tokens. For a length distribution with coefficient of variation c = σ/μ, the expected maximum of K draws grows like (lognormal extreme-value):

L_max(K) ≈ μ · exp( s·√(2 ln K) − s²/2 ), s = √(ln(1 + c²))

Plug in the typical reasoning-RL regime — K = 16, c = 1 — and the longest of the 16 is about 5× the mean. Since decode time is roughly linear in length (lesson 04), the batch spends most of its wall-clock waiting on one straggler while 15 actors sit idle. That idle is not a tuning problem; it is the shape of the distribution. This is the entire motivation for breaking the synchronous barrier.

The asymmetry that decides where to spend effort

L_max grows with √(ln K) but with the full width c. So lowering K helps only logarithmically (halving K barely moves the max), while capping the tail width pays back nearly linearly. That is why the fixes below all attack the tail — length caps, repacking, partial rollout — rather than just shrinking the batch.

2 · The fixes are all "stop waiting for the straggler"

Once you see τ_R as a max over a heavy tail, every rollout optimization is the same move — decouple the fast trajectories from the slow one so the batch boundary no longer pins the cluster. They compose multiplicatively (RL 22b's ladder, relative to a static-padded baseline of 1.0×):

Patch	What it changes	Cumulative τ_R
Static padding (baseline)	Pad every sequence to max length, one global batch.	1.00×
+ paged KV + continuous batching	No padding waste; a finished slot is refilled immediately (RL 20).	~0.30×
+ length cap at p95 with soft penalty	Truncate the extreme tail; penalize instead of waiting forever.	~0.18×
+ dynamic K (kill settled groups)	Stop sampling a prompt once its group reward has converged.	~0.13×
+ fp8 rollout + logprob recompute	Faster decode; recover exact logprobs on the learner side.	~0.08×

The endpoint of that ladder is trajectory-level asynchrony: don't wait for any batch at all. Consume each finished trajectory as it lands, repack stragglers into the next micro-batch, and let the learner step on a stream. Partial rollout (pause a long generation, train, resume it later under newer weights) is the same idea pushed into a single trajectory. But every one of these moves the data off the policy that will train on it — which is the cost we have to price next.

3 · The price of overlap — staleness, and why it is a budget

Break the barrier and the learner is now training on trajectories that were generated by older weights. RL corrects for this with an importance-sampling ratio (RL 06, RL 10): a sample generated under π_θ−Δ but trained under π_θ is reweighted by

ρ_t = π_θ(y_t | y_<t) / π_θ−Δ(y_t | y_<t), clipped to [1−ε, 1+ε]

The correction is only valid while ρ stays near 1. As staleness Δ (in learner steps) grows, the two policies drift apart, more tokens fall outside the clip range, and the gradient becomes biased and high-variance. Empirically the freshness wall sits around Δ ≈ 4–8 steps, where the clipped fraction crosses ~20–25% and learning destabilizes (RL 22c). So:

Async is a dial from on-policy to off-policy, not an on/off switch

Δ = 0 is strictly synchronous: cleanest gradient, worst utilization. Larger Δ buys overlap (the idle_band of 09b) but spends statistical validity. The dial is bounded by the freshness wall, and unlike a pure systems knob, pushing past it hurts the model, not just the clock. The only honest metric is reward at fixed GPU-hours — never steps/hour, which always improves and tells you nothing.

4 · So a trajectory must carry its version — the systems object

Now the data schema is not bureaucracy — it is derived. To compute ρ you need the logprobs and the policy version each token was sampled under; to repack and prioritize you need lengths and rewards; to debug a reward bug you need the env events. The trajectory therefore has to be a record, not a triple:

Trajectory {
  prompt_id, task_id, curriculum_bin,
  policy_version, reference_version,   // ← needed to compute ρ and the KL baseline
  tokens, logprobs, masks, tool_calls, // ← logprobs are the denominator of ρ
  env_events, reward_components, verifier_logs,
  started_at, finished_at, actor_id, env_id,
  length_tokens, wall_time_ms, stale_by_versions  // ← the freshness check
}

That schema is the control surface for every optimization in §2 and §3: partial rollout, dynamic sampling, retry, reward caching, staleness admission, replay prioritization, and regression analysis all read or write fields of it. A framework that logs only "prompt + response + reward" cannot do any of them correctly.

5 · The async graph and where it stalls

Wire those records through services and you get the SOTA rollout plane: a prompt queue feeds an actor fleet, completions go to env/reward, finished trajectories stream into a versioned store, the learner pulls from the store, and a weight service pushes fresh policy back to the actors while a controller enforces admission SLOs.

Each edge is a place the plane can stall, and each stall has a fix and a quality risk — the risk being the §3 staleness budget cashed out in a specific way:

Stall	Symptom	Fix	Quality risk
Decode wall	Actors at 100%, learner waits for tokens.	More actors, vLLM/SGLang, continuous batching, speculative decode.	Speculation must preserve the target distribution, or it is no longer on-policy.
Long-tail completions	A few huge responses hold the batch boundary (§1).	Partial rollout, repacking, abort/retract, length-aware sampling.	Dropping long samples biases toward short reasoning — unless tracked.
Environment tail	Tests/browsers/tools dominate p95.	Remote sandbox pools, cached verifiers, async reward queues, timeouts.	Timeout policy becomes a reward-design choice.
Store backpressure	Actors finish but can't enqueue; learner sees bursty batches.	Streaming transfer, sharded queues, priority admission.	Queue policy can silently shift the task distribution.
Freshness drift	Utilization rises but the reward curve gets worse.	Bounded staleness, staleness-aware PPO, recency priority, version gates.	Too much stale data turns near-on-policy RL into accidental off-policy RL.

6 · Backpressure and admission — the controller's SLOs

Async does not mean unbounded. The controller enforces a small set of SLOs that encode the §3 budget and the §1 tail directly:

Max policy lag: reject or downweight trajectories older than k versions (the freshness wall).
Max rollout wall time: abort, retract, or isolate requests past the task budget (the tail cap).
Min batch quality: avoid batches dominated by one task, one reward mode, or one length bucket.
Queue pressure: throttle prompt admission when the store or reward services saturate.
Replay policy: decide whether old high-reward samples are allowed, and under which objective.

The trap async designs fall into

Many optimize GPU utilization first, then discover the learner is training on a moving mixture of old policies, easy prompts, and short trajectories. Utilization and sample validity must be optimized together — which is only possible because every trajectory carries its version (§4).

SOTA patterns — each is one move on this graph

Pattern	System move	Seen in
Continuous rollout serving	Treat actors as inference servers: batching, KV management, prefix reuse, pause/resume for weight updates.	OpenRLHF + vLLM; slime + SGLang; verl rollout backends.
Speculative RL rollout	A drafter proposes tokens, the target policy verifies — lossless, so the on-policy distribution is preserved while decode speeds up.	NeMo-RL speculative decoding.
Producer-consumer streaming	Rollout, reward, and learner progress through a streamed queue instead of a global phase barrier.	AsyncFlow / Relax TransferQueue.
Trajectory-level asynchrony	Consume finished trajectories independently; repack stragglers instead of waiting for the slowest (§1 endpoint).	Laminar relay + repacking; AReaL fully-async.
Agent-training disaggregation	Separate arbitrary agent execution from training via a unified transition interface.	Agent Lightning; OpenRLHF token-in-token-out agents.
Search/learning decoupling	Scale exploration actors separately from learners with a replay-compatible objective — for sparse rewards, spend compute on search, not more PPO epochs.	Trajectory Balance with Asynchrony.

Interactive · rollout data-plane planner

Compare the synchronous wall (wait for the tail) to an async plane (overlap + repack the tail, pay a freshness penalty if lag grows). Push tail skew up and watch the synchronous wall explode while async stays flat — that gap is §1 made visible.

What carries forward

Rollout time is a max, not a mean: τ_R = max_i τ_traj,i, and with heavy-tailed reasoning lengths the longest of K=16 is ~5× the mean. The synchronous batch is hostage to one straggler.
Capping the tail beats shrinking K: the max grows like √(ln K) but with the full width c — so length caps, repacking, and partial rollout pay nearly linearly while lowering K barely helps.
The endpoint is trajectory-level async — consume each finished trajectory immediately — which is why frameworks chase it.
Overlap costs staleness, priced by the IS ratio ρ = π_θ/π_θ−Δ; the freshness wall (Δ ≈ 4–8, clip frac ~20–25%) bounds the dial. Measure reward at fixed compute.
Therefore the trajectory is a versioned record, not a triple — the logprobs and policy version are literally the inputs to the correction, and the control surface for every other optimization.

Sources used

Source	System idea used
AsyncFlow	Streaming data storage, fine-grained scheduling, producer-consumer async workflow.
AReaL	Fully asynchronous generation/training with workload balancing and staleness-aware PPO.
Laminar	Trajectory-level asynchrony, relay workers, and dynamic repacking for long-tail rollouts.
Relax	TransferQueue service decoupling and continuous staleness control.
NeMo-RL speculative decoding	System-integrated, distribution-preserving speculative decode for sync and async rollouts.
Agent Lightning	Training-agent disaggregation and a unified interface for arbitrary agent trajectories.
Trajectory Balance with Asynchrony	Decoupling exploration/search from learning with replay-compatible objectives.
OpenRLHF async training	Partial rollout, vLLM pause/resume, rollout/training overlap.
SGLang for RL	Sleep/wake, weight-update modes, long-tail rollout controls.