rl_lessons / 09 · REINFORCE lesson 9 / 14

REINFORCE — the starting point Williams 1992

Everything from here on is a variance-reduction patch on one line of code. Before you can read the patches, you have to read the line.

Where we are in the series
Lessons 01–08 covered the framework (the six roles around the loss). Lessons 09–14 zoom into the loss itself — the half-dozen algorithms that have actually trained today's frontier models. Same task, same prompt, only the gradient estimator changes.

The policy-gradient theorem, one more time

From lesson 1: we want to maximize J(θ) = 𝔼x, y ∼ πθ[R(x, y)]. The log-derivative trick gives

θ J(θ)  =  𝔼x, y ∼ πθ [ R(x, y) · ∇θ log πθ(y | x) ]

For a language model the response is autoregressive: log πθ(y | x) = Σt log πθ(yt | x, y<t). Drop that into the gradient and the REINFORCE estimator on a single rollout y ∼ πθ(·|x) is

θ  =  R(x, y) · Σtθ log πθ(yt | x, y<t)

and the loss whose gradient is −ĝ is just

LREINFORCE  =  − R · Σt log πθ(yt | x, y<t)

That is REINFORCE. One scalar reward, multiplied onto the sum of per-token log-probs, with a minus sign so we can use a minimization optimizer.

# From RL/algorithms/00_reinforce.py — reinforce_step
seq_logp = (pol_logp * target_mask).sum(dim=-1)     # (K,) Σ_t log π(y_{i,t} | …)
pg_loss  = -(rewards.detach() * seq_logp).mean()    # − (1/K) Σ_i R_i · seq_logp_i
# plus β · KL anchor; same as every other algorithm in the folder.

Why it's noisy — the structural variance problem

Two compounding sources of variance. Both are why nobody trains a frontier model with raw REINFORCE.

  1. R is multiplied onto every per-token log-prob. For a binary verifier at, say, ~50% success rate, half your rollouts contribute +1 · ∇log π and half contribute 0 · ∇log π — the worst-case Bernoulli variance. The gradient is "do more of what you just did, weighted by whether you got lucky" — which mixes reinforcement of good behaviors with reinforcement of any behavior that happened to score.
  2. There's no baseline. Any function of the prompt b(x) can be subtracted from R without biasing the estimator (we'll prove this in lesson 11). REINFORCE picks b(x) = 0 — the worst constant baseline in the variance-minimizing sense. The right baseline (close to 𝔼y[R]) cuts variance dramatically.

Interactive · signal-to-noise vs. K

The simplest variance reduction available to REINFORCE: average over K rollouts before stepping. The expectation is the same; the variance shrinks by ~1/K. Below, watch how the gradient estimator's direction stabilizes as K grows. The "true" gradient is the direction the policy should move toward; each random-K estimate is what one batch actually sees.

Gradient estimator noise vs. K
Each click samples K rollouts and shows the resulting REINFORCE gradient vector (orange) against the underlying reward direction it should align with (blue). At K=1 the estimate swings wildly; at K=16 it concentrates near the truth. (Toy: the blue arrow is the parameter direction the reward function favors, not literally the expected gradient — but the estimator is consistent with it.)
Latest |ĝ − ∇J|
Avg over 50 trials
Theoretical 1/√K

The estimator's standard error scales as σ / √K — quadrupling K halves the noise. That's the entire "average more rollouts" argument; you can't do better than this without changing the estimator itself, which is what every algorithm in the rest of this folder does.

The KL anchor is non-negotiable

Even at K=∞, REINFORCE on a verifiable task has a degenerate failure mode: the policy can lock onto an output that sometimes parses to the right answer by luck and reward-hack the verifier. The fix is the KL term we've already met:

L  =  LREINFORCE  +  β · KL(πθ ‖ πref)

where πref is the SFT checkpoint. Without it, an unregularized REINFORCE will happily collapse onto "always emit token X" if X is a digit that occasionally parses correctly. We touched this in lesson 3; the takeaway is that the anchor is part of every algorithm in this folder, not an algorithm-specific feature.

What REINFORCE doesn't have, in one table

Every line in this table is something one of the upcoming algorithms adds. Read it as a roadmap.

FeatureREINFORCEAdds it in lesson…
State-only baseline (cuts variance)10 (PPO learns one), 11 (GRPO uses group mean), 12 (RLOO uses leave-one-out)
Off-policy correction (re-use rollouts)10 (importance ratio)
Trust region (cap step size)10 (PPO clip)
Asymmetric clip (prevent entropy collapse)13 (DAPO clip-higher)
Filter zero-variance groups13 (DAPO dynamic sampling)
Token-level loss aggregation13 (DAPO) and 14 (Dr.GRPO)
Length-bias correction14 (Dr.GRPO drops /|y|)
Takeaway
REINFORCE is one line: multiply log-prob by reward, add a KL anchor. It works — see the reward EMA climb in 00_reinforce.py's output. It's also noisy. Every algorithm in the next five lessons is a different answer to "where does the variance come from, and how do we kill it?".