Rollout — sampling and πold
Autoregressive sampling, why we sample K per prompt, and why old_logp has to be captured at sampling time — not recomputed later.
What "rollout" means here
A rollout is the act of producing one response from the policy. Concretely: feed the prompt tokens to the LLM, look at the logits for the next token, sample one, append it, repeat. The full sequence of sampled tokens — usually until an end-of-turn token or a length cap — is one trajectory.
prompt: <3+4+5>
sample t=0: '1' π_θ(·|prompt) → multinomial → '1'
sample t=1: '2' π_θ(·|prompt,'1') → multinomial → '2'
sample t=2: '#' (EOS) π_θ(·|prompt,'12') → multinomial → '#'
response: 12#
Three observations about that snippet that turn into design pressure on the framework:
- The forward pass at each step is purely inference. No gradients flow. No optimizer state. We just need next-token logits as fast as possible.
- The dominant cost is the KV-cache. A naive implementation re-encodes the whole prefix every step, which is O(T2). Production systems cache the keys/values for past tokens and pay O(T) per step. This is what vLLM, SGLang, TensorRT-LLM exist to do well.
- The reward is terminal. The verifier reads the whole final response. There's no per-token reward to credit-assign with; that responsibility lands on the algorithm in lesson 4.
Why we sample K — variance reduction
Recall the policy gradient: ∇J ≈ r(y) · ∇ log πθ(y) with y ∼ πθ. One sample is a high-variance estimate. With K samples per prompt we get K independent estimates of the gradient — which we'll average — and, crucially, we can compute a baseline: the mean reward in the group. Subtracting that mean cuts the variance dramatically without biasing the gradient (because the gradient of a constant times ∇log π is zero in expectation).
Interactive · K rollouts on the toy task
Below: a stand-in policy that has settled on a (deliberately imperfect) habit. Click Sample K trajectories. Each row is one rollout — same prompt, different sampled responses. The reward column is filled by the same verifier you'll meet in lesson 3.
The other thing the rollout records: old_logp
At each sampling step we already compute log-probabilities for the chosen token (we needed them to sample). Storing them costs essentially nothing. Why bother? Because the loss in lesson 4 needs the ratio
where πθ is the current trainer's policy and πold is whatever policy actually produced the tokens. If the trainer takes a step between sampling and computing the loss, these two are different policies. Recomputing old_logp after the step gives you πθ(y) / πθ(y) = 1 exactly — silently destroying the entire purpose of the ratio.
old_logp with the trainer's freshest weights when collating the batch. The ratio is then identically 1, the PPO clip never fires, and training looks "fine" on a loss curve — but the algorithm has silently degenerated to plain REINFORCE without clipping. The fix: capture old_logp during generation, store it on the trajectory, never recompute. See rollout.RolloutEngine.generate.
Where this maps in the code
Inside rl_framework/rollout.py, the generation kernel _sample_from_prompt does both jobs at once:
logits, _ = self.model(ids) # one forward pass; (K, T, V)
last = logits[:, -1, :] / temp # (K, V)
logp = F.log_softmax(last, dim=-1) # (K, V)
probs = logp.exp() # (K, V)
nxt = torch.multinomial(probs, 1).squeeze(-1) # sample; (K,)
# This line is the whole reason rollout.py exists rather than reusing
# rl_sota/core.rollout_group: we store the log-prob of the *sampled*
# token, in the policy that produced it. That's π_old.
step_logp = logp.gather(-1, nxt.unsqueeze(-1)).squeeze(-1) # (K,)
resp_ids [:, t] = nxt
resp_mask[:, t] = (~done).float()
tok_logp [:, t] = step_logp * resp_mask[:, t]
One subtlety: we sample from logp.exp() rather than from a separately-computed softmax. That ensures the sampling distribution and the scored distribution are identical to bit-equality. Sampling from one distribution and scoring under another is a famous bug — temperature scaling applied differently in the two paths is the most common cause.
Why this is its own role
A rollout engine has a completely different shape from a trainer:
| Aspect | Rollout engine | Trainer |
|---|---|---|
| Forward / backward | Forward only | Forward + backward |
| Memory dominator | KV cache | Activations + optimizer state |
| Optimal sharding | Tensor parallel for decode | FSDP / ZeRO-3 for grads |
| Quantization | bf16 / int8 OK | fp32 master + bf16 compute |
| Batch shape | Many short, streaming | One padded (B, T) |
| Optimized by | vLLM, SGLang, TRT-LLM | FSDP, DeepSpeed, Megatron |
In a single process they block each other (you never overlap them — one finishes, the other starts). Splitting rollout into its own role with its own GPUs is the entire economic reason production frameworks like verl, OpenRLHF, and SLIME exist.
old_logp for each sampled token, (3) attach the terminal reward. Everything else — KV caching, weight sharding, async — is performance engineering on top of this contract.