REINFORCE — the starting point Williams 1992

Everything from here on is a variance-reduction patch on one line of code. Before you can read the patches, you have to read the line.

Where we are in the series

Lessons 01–08 covered the framework (the six roles around the loss). Lessons 09–14 zoom into the loss itself — the half-dozen algorithms that have actually trained today's frontier models. Same task, same prompt, only the gradient estimator changes.

The policy-gradient theorem, one more time

From lesson 1: we want to maximize J(θ) = 𝔼_{x, y ∼ π_θ}[R(x, y)]. The log-derivative trick gives

∇_θ J(θ) = 𝔼_{x, y ∼ π_θ} [ R(x, y) · ∇_θ log π_θ(y | x) ]

For a language model the response is autoregressive: log π_θ(y | x) = Σ_t log π_θ(y_t | x, y_<t). Drop that into the gradient and the REINFORCE estimator on a single rollout y ∼ π_θ(·|x) is

ĝ_θ = R(x, y) · Σ_t ∇_θ log π_θ(y_t | x, y_<t)

and the loss whose gradient is −ĝ is just

L_REINFORCE = − R · Σ_t log π_θ(y_t | x, y_<t)

That is REINFORCE. One scalar reward, multiplied onto the sum of per-token log-probs, with a minus sign so we can use a minimization optimizer.

# From RL/algorithms/00_reinforce.py — reinforce_step
seq_logp = (pol_logp * target_mask).sum(dim=-1)     # (K,) Σ_t log π(y_{i,t} | …)
pg_loss  = -(rewards.detach() * seq_logp).mean()    # − (1/K) Σ_i R_i · seq_logp_i
# plus β · KL anchor; same as every other algorithm in the folder.

Why it's noisy — the structural variance problem

Two compounding sources of variance. Both are why nobody trains a frontier model with raw REINFORCE.

R is multiplied onto every per-token log-prob. For a binary verifier at, say, ~50% success rate, half your rollouts contribute +1 · ∇log π and half contribute 0 · ∇log π — the worst-case Bernoulli variance. The gradient is "do more of what you just did, weighted by whether you got lucky" — which mixes reinforcement of good behaviors with reinforcement of any behavior that happened to score.
There's no baseline. Any function of the prompt b(x) can be subtracted from R without biasing the estimator (we'll prove this in lesson 11). REINFORCE picks b(x) = 0 — the worst constant baseline in the variance-minimizing sense. The right baseline (close to 𝔼_y[R]) cuts variance dramatically.

Interactive · signal-to-noise vs. K

The simplest variance reduction available to REINFORCE: average over K rollouts before stepping. The expectation is the same; the variance shrinks by ~1/K. Below, watch how the gradient estimator's direction stabilizes as K grows. The "true" gradient is the direction the policy should move toward; each random-K estimate is what one batch actually sees.

Gradient estimator noise vs. K

Each click samples K rollouts and shows the resulting REINFORCE gradient vector (orange) against the underlying reward direction it should align with (blue). At K=1 the estimate swings wildly; at K=16 it concentrates near the truth. (Toy: the blue arrow is the parameter direction the reward function favors, not literally the expected gradient — but the estimator is consistent with it.)

K: 1

Latest |ĝ − ∇J|

—

Avg over 50 trials

—

Theoretical 1/√K

—

The estimator's standard error scales as σ / √K — quadrupling K halves the noise. That's the entire "average more rollouts" argument; you can't do better than this without changing the estimator itself, which is what every algorithm in the rest of this folder does.

The KL anchor is non-negotiable

Even at K=∞, REINFORCE on a verifiable task has a degenerate failure mode: the policy can lock onto an output that sometimes parses to the right answer by luck and reward-hack the verifier. The fix is the KL term we've already met:

L = L_REINFORCE + β · KL(π_θ ‖ π_ref)

where π_ref is the SFT checkpoint. Without it, an unregularized REINFORCE will happily collapse onto "always emit token X" if X is a digit that occasionally parses correctly. We touched this in lesson 3; the takeaway is that the anchor is part of every algorithm in this folder, not an algorithm-specific feature.

What REINFORCE doesn't have, in one table

Every line in this table is something one of the upcoming algorithms adds. Read it as a roadmap.

Feature	REINFORCE	Adds it in lesson…
State-only baseline (cuts variance)	—	10 (PPO learns one), 11 (GRPO uses group mean), 12 (RLOO uses leave-one-out)
Off-policy correction (re-use rollouts)	—	10 (importance ratio)
Trust region (cap step size)	—	10 (PPO clip)
Asymmetric clip (prevent entropy collapse)	—	13 (DAPO clip-higher)
Filter zero-variance groups	—	13 (DAPO dynamic sampling)
Token-level loss aggregation	—	13 (DAPO) and 14 (Dr.GRPO)
Length-bias correction	—	14 (Dr.GRPO drops /\|y\|)

Takeaway

REINFORCE is one line: multiply log-prob by reward, add a KL anchor. It works — see the reward EMA climb in 00_reinforce.py's output. It's also noisy. Every algorithm in the next five lessons is a different answer to "where does the variance come from, and how do we kill it?".