Rollout — sampling and π_old

Autoregressive sampling, why we sample K per prompt, and why old_logp has to be captured at sampling time — not recomputed later.

Quick gloss

π_old = the policy weights that actually produced the tokens we sampled. After the trainer takes one step, π_old is no longer equal to the current trainer's policy π_θ — the gap between them is what lessons 25 (PPO ratio) and 6 (weight sync) are about.

What "rollout" means here

A rollout is the act of producing one response from the policy. Concretely: feed the prompt tokens to the LLM, look at the logits for the next token, sample one, append it, repeat. The full sequence of sampled tokens — usually until an end-of-turn token or a length cap — is one trajectory.

prompt:      <3+4+5>
sample t=0:  '1'              π_θ(·|prompt)       → multinomial → '1'
sample t=1:  '2'              π_θ(·|prompt,'1')   → multinomial → '2'
sample t=2:  '#' (EOS)        π_θ(·|prompt,'12')  → multinomial → '#'
response:    12#

Three observations about that snippet that turn into design pressure on the framework:

The forward pass at each step is purely inference. No gradients flow. No optimizer state. We just need next-token logits as fast as possible.
The dominant cost is the KV-cache. A naive implementation re-encodes the whole prefix every step, which is O(T²). Production systems cache the keys/values for past tokens and pay O(T) per step. This is what vLLM, SGLang, TensorRT-LLM exist to do well.
The reward is terminal. The verifier reads the whole final response. There's no per-token reward to credit-assign with; that responsibility lands on the algorithm in lesson 25.

Why we sample K — variance reduction

Recall the policy gradient: ∇J ≈ r(y) · ∇ log π_θ(y) with y ∼ π_θ. One sample is a high-variance estimate. With K samples per prompt we get K independent estimates of the gradient — which we'll average — and, crucially, we can compute a baseline: the mean reward in the group. Subtracting that mean cuts the variance dramatically without biasing the gradient (because the gradient of a constant times ∇log π is zero in expectation).

This is the core of GRPO

GRPO is "REINFORCE with the group mean as a baseline and a per-group standard-deviation normalizer." The group in GRPO is the K rollouts you took from one prompt. That's why every modern reasoning-RL recipe samples in groups.

Interactive · K rollouts on the toy task

Below: a stand-in policy that has settled on a (deliberately imperfect) habit. Click Sample K trajectories. Each row is one rollout — same prompt, different sampled responses. The reward column is filled by the same verifier you'll meet in lesson 24.

K trajectories from one prompt

Prompt is fixed at <3+4+5> (gold=12). The policy mostly gets it right but sometimes emits a wrong digit — exactly the noise that RL is supposed to clean up.

K: 4 temperature: 1.00

Mean reward

—

Std reward

—

Degenerate?

—

Degenerate group

If every rollout in the group got the same reward, std=0 and every advantage will be zero — meaning the gradient is zero. There is nothing to learn from this group; GRPO.compute_advantages marks it and the controller skips the optimizer step. You will see this happen at low temperatures with K=2 here.

The other thing the rollout records: `old_logp`

At each sampling step we already compute log-probabilities for the chosen token (we needed them to sample). Storing them costs essentially nothing. Why bother? Because the loss in lesson 25 needs the ratio

ρ_t = π_θ(y_t | y_<t, x) / π_old(y_t | y_<t, x)

where π_θ is the current trainer's policy and π_old is whatever policy actually produced the tokens. If the trainer takes a step between sampling and computing the loss, these two are different policies. Recomputing old_logp after the step gives you π_θ(y) / π_θ(y) = 1 exactly — silently destroying the entire purpose of the ratio.

Production bug, observed in the wild

Naive RL implementations re-compute old_logp with the trainer's freshest weights when collating the batch. The ratio is then identically 1, the PPO clip never fires, and training looks "fine" on a loss curve — but the algorithm has silently degenerated to plain REINFORCE without clipping. The fix: capture old_logp during generation, store it on the trajectory, never recompute. See rollout.RolloutEngine.generate.

Where this maps in the code

Inside rl_framework/rollout.py, the generation kernel _sample_from_prompt does both jobs at once:

logits, _ = self.model(ids)              # one forward pass; (K, T, V)
last = logits[:, -1, :] / temp            # (K, V)
logp = F.log_softmax(last, dim=-1)        # (K, V)
probs = logp.exp()                        # (K, V)
nxt   = torch.multinomial(probs, 1).squeeze(-1)   # sample;  (K,)

# This line is the whole reason rollout.py exists rather than reusing
# rl_sota/core.rollout_group: we store the log-prob of the *sampled*
# token, in the policy that produced it. That's π_old.
step_logp = logp.gather(-1, nxt.unsqueeze(-1)).squeeze(-1)   # (K,)

resp_ids [:, t] = nxt
resp_mask[:, t] = (~done).float()
tok_logp [:, t] = step_logp * resp_mask[:, t]

One subtlety: we sample from logp.exp() rather than from a separately-computed softmax. That ensures the sampling distribution and the scored distribution are identical to bit-equality. Sampling from one distribution and scoring under another is a famous bug — temperature scaling applied differently in the two paths is the most common cause.

Why this is its own role

A rollout engine has a completely different shape from a trainer:

Aspect	Rollout engine	Trainer
Forward / backward	Forward only	Forward + backward
Memory dominator	KV cache	Activations + optimizer state
Optimal sharding	Tensor parallel for decode	FSDP / ZeRO-3 for grads
Quantization	bf16 / int8 OK	fp32 master + bf16 compute
Batch shape	Many short, streaming	One padded (B, T)
Optimized by	vLLM, SGLang, TRT-LLM	FSDP, DeepSpeed, Megatron

In a single process they block each other (you never overlap them — one finishes, the other starts). Splitting rollout into its own role with its own GPUs is the entire economic reason production frameworks like verl, OpenRLHF, and SLIME exist.

Takeaway

Rollout's three jobs: (1) sample K trajectories per prompt, (2) record old_logp for each sampled token, (3) attach the terminal reward. Everything else — KV caching, weight sharding, async — is performance engineering on top of this contract.