rl_lessons / 02 · rollout lesson 2 / 8

Rollout — sampling and πold

Autoregressive sampling, why we sample K per prompt, and why old_logp has to be captured at sampling time — not recomputed later.

Quick gloss
πold = the policy weights that actually produced the tokens we sampled. After the trainer takes one step, πold is no longer equal to the current trainer's policy πθ — the gap between them is what lessons 4 (PPO ratio) and 6 (weight sync) are about.

What "rollout" means here

A rollout is the act of producing one response from the policy. Concretely: feed the prompt tokens to the LLM, look at the logits for the next token, sample one, append it, repeat. The full sequence of sampled tokens — usually until an end-of-turn token or a length cap — is one trajectory.

prompt:      <3+4+5>
sample t=0:  '1'              π_θ(·|prompt)       → multinomial → '1'
sample t=1:  '2'              π_θ(·|prompt,'1')   → multinomial → '2'
sample t=2:  '#' (EOS)        π_θ(·|prompt,'12')  → multinomial → '#'
response:    12#

Three observations about that snippet that turn into design pressure on the framework:

  1. The forward pass at each step is purely inference. No gradients flow. No optimizer state. We just need next-token logits as fast as possible.
  2. The dominant cost is the KV-cache. A naive implementation re-encodes the whole prefix every step, which is O(T2). Production systems cache the keys/values for past tokens and pay O(T) per step. This is what vLLM, SGLang, TensorRT-LLM exist to do well.
  3. The reward is terminal. The verifier reads the whole final response. There's no per-token reward to credit-assign with; that responsibility lands on the algorithm in lesson 4.

Why we sample K — variance reduction

Recall the policy gradient: ∇J ≈ r(y) · ∇ log πθ(y) with y ∼ πθ. One sample is a high-variance estimate. With K samples per prompt we get K independent estimates of the gradient — which we'll average — and, crucially, we can compute a baseline: the mean reward in the group. Subtracting that mean cuts the variance dramatically without biasing the gradient (because the gradient of a constant times ∇log π is zero in expectation).

This is the core of GRPO
GRPO is "REINFORCE with the group mean as a baseline and a per-group standard-deviation normalizer." The group in GRPO is the K rollouts you took from one prompt. That's why every modern reasoning-RL recipe samples in groups.

Interactive · K rollouts on the toy task

Below: a stand-in policy that has settled on a (deliberately imperfect) habit. Click Sample K trajectories. Each row is one rollout — same prompt, different sampled responses. The reward column is filled by the same verifier you'll meet in lesson 3.

K trajectories from one prompt
Prompt is fixed at <3+4+5> (gold=12). The policy mostly gets it right but sometimes emits a wrong digit — exactly the noise that RL is supposed to clean up.
Mean reward
Std reward
Degenerate?
Degenerate group
If every rollout in the group got the same reward, std=0 and every advantage will be zero — meaning the gradient is zero. There is nothing to learn from this group; GRPO.compute_advantages marks it and the controller skips the optimizer step. You will see this happen at low temperatures with K=2 here.

The other thing the rollout records: old_logp

At each sampling step we already compute log-probabilities for the chosen token (we needed them to sample). Storing them costs essentially nothing. Why bother? Because the loss in lesson 4 needs the ratio

ρt  =  πθ(yt | y<t, x) / πold(yt | y<t, x)

where πθ is the current trainer's policy and πold is whatever policy actually produced the tokens. If the trainer takes a step between sampling and computing the loss, these two are different policies. Recomputing old_logp after the step gives you πθ(y) / πθ(y) = 1 exactly — silently destroying the entire purpose of the ratio.

Production bug, observed in the wild
Naive RL implementations re-compute old_logp with the trainer's freshest weights when collating the batch. The ratio is then identically 1, the PPO clip never fires, and training looks "fine" on a loss curve — but the algorithm has silently degenerated to plain REINFORCE without clipping. The fix: capture old_logp during generation, store it on the trajectory, never recompute. See rollout.RolloutEngine.generate.

Where this maps in the code

Inside rl_framework/rollout.py, the generation kernel _sample_from_prompt does both jobs at once:

logits, _ = self.model(ids)              # one forward pass; (K, T, V)
last = logits[:, -1, :] / temp            # (K, V)
logp = F.log_softmax(last, dim=-1)        # (K, V)
probs = logp.exp()                        # (K, V)
nxt   = torch.multinomial(probs, 1).squeeze(-1)   # sample;  (K,)

# This line is the whole reason rollout.py exists rather than reusing
# rl_sota/core.rollout_group: we store the log-prob of the *sampled*
# token, in the policy that produced it. That's π_old.
step_logp = logp.gather(-1, nxt.unsqueeze(-1)).squeeze(-1)   # (K,)

resp_ids [:, t] = nxt
resp_mask[:, t] = (~done).float()
tok_logp [:, t] = step_logp * resp_mask[:, t]

One subtlety: we sample from logp.exp() rather than from a separately-computed softmax. That ensures the sampling distribution and the scored distribution are identical to bit-equality. Sampling from one distribution and scoring under another is a famous bug — temperature scaling applied differently in the two paths is the most common cause.

Why this is its own role

A rollout engine has a completely different shape from a trainer:

AspectRollout engineTrainer
Forward / backwardForward onlyForward + backward
Memory dominatorKV cacheActivations + optimizer state
Optimal shardingTensor parallel for decodeFSDP / ZeRO-3 for grads
Quantizationbf16 / int8 OKfp32 master + bf16 compute
Batch shapeMany short, streamingOne padded (B, T)
Optimized byvLLM, SGLang, TRT-LLMFSDP, DeepSpeed, Megatron

In a single process they block each other (you never overlap them — one finishes, the other starts). Splitting rollout into its own role with its own GPUs is the entire economic reason production frameworks like verl, OpenRLHF, and SLIME exist.

Takeaway
Rollout's three jobs: (1) sample K trajectories per prompt, (2) record old_logp for each sampled token, (3) attach the terminal reward. Everything else — KV caching, weight sharding, async — is performance engineering on top of this contract.