Rollout — sampling and πold
Autoregressive sampling, why we sample K per prompt, and why old_logp has to be captured at sampling time — not recomputed later.
What "rollout" means here
A rollout is the act of producing one response from the policy. Concretely: feed the prompt tokens to the LLM, look at the logits for the next token, sample one, append it, repeat. The full sequence of sampled tokens — usually until an end-of-turn token or a length cap — is one trajectory.
prompt: <3+4+5>
sample t=0: '1' π_θ(·|prompt) → multinomial → '1'
sample t=1: '2' π_θ(·|prompt,'1') → multinomial → '2'
sample t=2: '#' (EOS) π_θ(·|prompt,'12') → multinomial → '#'
response: 12#
Three observations about that snippet that turn into design pressure on the framework:
- The forward pass at each step is purely inference. No gradients flow. No optimizer state. We just need next-token logits as fast as possible.
- The dominant cost is the KV-cache. A naive implementation re-encodes the whole prefix every step, which is O(T2). Production systems cache the keys/values for past tokens and pay O(T) per step. This is what vLLM, SGLang, TensorRT-LLM exist to do well.
- The reward is terminal. The verifier reads the whole final response. There's no per-token reward to credit-assign with; that responsibility lands on the algorithm in lesson 25.
Why we sample K — variance reduction
Recall the policy gradient: ∇J ≈ r(y) · ∇ log πθ(y) with y ∼ πθ. One sample is a high-variance estimate. With K samples per prompt we get K independent estimates of the gradient — which we'll average — and, crucially, we can compute a baseline: the mean reward in the group. Subtracting that mean cuts the variance dramatically without biasing the gradient (because the gradient of a constant times ∇log π is zero in expectation).
Interactive · K rollouts on the toy task
Below: a stand-in policy that has settled on a (deliberately imperfect) habit. Click Sample K trajectories. Each row is one rollout — same prompt, different sampled responses. The reward column is filled by the same verifier you'll meet in lesson 24.
The other thing the rollout records: old_logp
At each sampling step we already compute log-probabilities for the chosen token (we needed them to sample). Storing them costs essentially nothing. Why bother? Because the loss in lesson 25 needs the ratio
where πθ is the current trainer's policy and πold is whatever policy actually produced the tokens. If the trainer takes a step between sampling and computing the loss, these two are different policies. Recomputing old_logp after the step gives you πθ(y) / πθ(y) = 1 exactly — silently destroying the entire purpose of the ratio.
old_logp with the trainer's freshest weights when collating the batch. The ratio is then identically 1, the PPO clip never fires, and training looks "fine" on a loss curve — but the algorithm has silently degenerated to plain REINFORCE without clipping. The fix: capture old_logp during generation, store it on the trajectory, never recompute. See rollout.RolloutEngine.generate.
Where this maps in the code
Inside rl_framework/rollout.py, the generation kernel _sample_from_prompt does both jobs at once:
logits, _ = self.model(ids) # one forward pass; (K, T, V)
last = logits[:, -1, :] / temp # (K, V)
logp = F.log_softmax(last, dim=-1) # (K, V)
probs = logp.exp() # (K, V)
nxt = torch.multinomial(probs, 1).squeeze(-1) # sample; (K,)
# This line is the whole reason rollout.py exists rather than reusing
# rl_sota/core.rollout_group: we store the log-prob of the *sampled*
# token, in the policy that produced it. That's π_old.
step_logp = logp.gather(-1, nxt.unsqueeze(-1)).squeeze(-1) # (K,)
resp_ids [:, t] = nxt
resp_mask[:, t] = (~done).float()
tok_logp [:, t] = step_logp * resp_mask[:, t]
One subtlety: we sample from logp.exp() rather than from a separately-computed softmax. That ensures the sampling distribution and the scored distribution are identical to bit-equality. Sampling from one distribution and scoring under another is a famous bug — temperature scaling applied differently in the two paths is the most common cause.
Why this is its own role
A rollout engine has a completely different shape from a trainer:
| Aspect | Rollout engine | Trainer |
|---|---|---|
| Forward / backward | Forward only | Forward + backward |
| Memory dominator | KV cache | Activations + optimizer state |
| Optimal sharding | Tensor parallel for decode | FSDP / ZeRO-3 for grads |
| Quantization | bf16 / int8 OK | fp32 master + bf16 compute |
| Batch shape | Many short, streaming | One padded (B, T) |
| Optimized by | vLLM, SGLang, TRT-LLM | FSDP, DeepSpeed, Megatron |
In a single process they block each other (you never overlap them — one finishes, the other starts). Splitting rollout into its own role with its own GPUs is the entire economic reason production frameworks like verl, OpenRLHF, and SLIME exist.
old_logp for each sampled token, (3) attach the terminal reward. Everything else — KV caching, weight sharding, async — is performance engineering on top of this contract.