REINFORCE — the starting point Williams 1992
Everything from here on is a variance-reduction patch on one line of code. Before you can read the patches, you have to read the line.
The policy-gradient theorem, one more time
From lesson 1: we want to maximize J(θ) = 𝔼x, y ∼ πθ[R(x, y)]. The log-derivative trick gives
For a language model the response is autoregressive: log πθ(y | x) = Σt log πθ(yt | x, y<t). Drop that into the gradient and the REINFORCE estimator on a single rollout y ∼ πθ(·|x) is
and the loss whose gradient is −ĝ is just
That is REINFORCE. One scalar reward, multiplied onto the sum of per-token log-probs, with a minus sign so we can use a minimization optimizer.
# From RL/algorithms/00_reinforce.py — reinforce_step
seq_logp = (pol_logp * target_mask).sum(dim=-1) # (K,) Σ_t log π(y_{i,t} | …)
pg_loss = -(rewards.detach() * seq_logp).mean() # − (1/K) Σ_i R_i · seq_logp_i
# plus β · KL anchor; same as every other algorithm in the folder.
Why it's noisy — the structural variance problem
Two compounding sources of variance. Both are why nobody trains a frontier model with raw REINFORCE.
- R is multiplied onto every per-token log-prob. For a binary verifier at, say, ~50% success rate, half your rollouts contribute +1 · ∇log π and half contribute 0 · ∇log π — the worst-case Bernoulli variance. The gradient is "do more of what you just did, weighted by whether you got lucky" — which mixes reinforcement of good behaviors with reinforcement of any behavior that happened to score.
- There's no baseline. Any function of the prompt b(x) can be subtracted from R without biasing the estimator (we'll prove this in lesson 11). REINFORCE picks b(x) = 0 — the worst constant baseline in the variance-minimizing sense. The right baseline (close to 𝔼y[R]) cuts variance dramatically.
Interactive · signal-to-noise vs. K
The simplest variance reduction available to REINFORCE: average over K rollouts before stepping. The expectation is the same; the variance shrinks by ~1/K. Below, watch how the gradient estimator's direction stabilizes as K grows. The "true" gradient is the direction the policy should move toward; each random-K estimate is what one batch actually sees.
The KL anchor is non-negotiable
Even at K=∞, REINFORCE on a verifiable task has a degenerate failure mode: the policy can lock onto an output that sometimes parses to the right answer by luck and reward-hack the verifier. The fix is the KL term we've already met:
where πref is the SFT checkpoint. Without it, an unregularized REINFORCE will happily collapse onto "always emit token X" if X is a digit that occasionally parses correctly. We touched this in lesson 3; the takeaway is that the anchor is part of every algorithm in this folder, not an algorithm-specific feature.
What REINFORCE doesn't have, in one table
Every line in this table is something one of the upcoming algorithms adds. Read it as a roadmap.
| Feature | REINFORCE | Adds it in lesson… |
|---|---|---|
| State-only baseline (cuts variance) | — | 10 (PPO learns one), 11 (GRPO uses group mean), 12 (RLOO uses leave-one-out) |
| Off-policy correction (re-use rollouts) | — | 10 (importance ratio) |
| Trust region (cap step size) | — | 10 (PPO clip) |
| Asymmetric clip (prevent entropy collapse) | — | 13 (DAPO clip-higher) |
| Filter zero-variance groups | — | 13 (DAPO dynamic sampling) |
| Token-level loss aggregation | — | 13 (DAPO) and 14 (Dr.GRPO) |
| Length-bias correction | — | 14 (Dr.GRPO drops /|y|) |
00_reinforce.py's output. It's also noisy. Every algorithm in the next five lessons is a different answer to "where does the variance come from, and how do we kill it?".