What is post-training RL?

The minimum viable loop, and the one equation that makes it work.

The setup, in one sentence

Supervised fine-tuning (SFT) teaches a model by showing it correct outputs. RL post-training teaches a model by letting it try, scoring what it produced, and nudging its probability mass toward attempts that scored well.

The distinction matters because for many tasks — math problems, code, multi-step reasoning — we have no labeled "correct chain of thought" to imitate. We only have a way to check if a final answer is right. That checker (a verifier, a unit-test runner, a reward model) is enough for RL; it is not enough for supervised learning.

Vocabulary, kept compact

π_θ — the policy: the LLM we're training. Given a prompt, it produces a probability distribution over next tokens; sampling from it produces a response.
x — a prompt. For us, e.g. <3+4+5>.
y — a response. The full string the policy generates token-by-token.
r(x, y) — the reward. For verifiable tasks this is binary: 1 if the answer is right, 0 otherwise.

The objective

We want to find policy weights θ that maximize the expected reward when we sample from the policy:

J(θ) = 𝔼_{x ∼ D} 𝔼_{y ∼ π_θ(·|x)} [ r(x, y) ]

Read aloud: "average reward, over prompts drawn from our task distribution and responses drawn from the policy." If we knew ∇_θ J, we'd just do gradient ascent: θ ← θ + η · ∇_θ J. So the question is: how do we compute that gradient?

The one equation

Direct differentiation fails — y is sampled from π_θ, so the distribution we're averaging over depends on the thing we're differentiating. The way out is a single identity, the log-derivative trick:

∇_θ 𝔼_{y ∼ π_θ} [ r(y) ] = 𝔼_{y ∼ π_θ} [ r(y) · ∇_θ log π_θ(y) ]

The derivation is one line: write the expectation as a sum, differentiate, multiply and divide by π_θ(y), use ∇log π = ∇π / π, write back as an expectation. The consequence is huge: you can estimate the gradient by sampling. Sample y from the policy, multiply log π_θ(y) by its observed reward, take a gradient. That is REINFORCE in its simplest form.

Intuition: push log-probability up in proportion to reward. The reward acts as a per-sample learning-rate — high-reward samples move more than low-reward ones. (If rewards are strictly positive — as in the bandit below — every sampled action gets pushed up, just by different amounts. Subtracting a baseline from the reward, which we'll do in lesson 4, is what makes below-average actions get pushed down.)

Why this is harder than supervised learning

In SFT, gradients are certain: there is one correct token at every position. In RL, gradients are noisy: you sample, you might be lucky, you might be unlucky, and the gradient is the reward-weighted average of your luck. Every algorithm we'll meet in lesson 4 is a way to reduce that variance.

The minimum viable loop

Strip away every optimization, every algorithm, every system concern. The loop is:

repeat:
    y ~ π_θ(·|x)               # sample a response from the policy  (ROLLOUT)
    r = verify(y)                # score it                              (ENV / REWARD)
    θ ← θ + η · r · ∇_θ log π_θ(y)   # update toward sampled high-reward y  (ALGORITHM + TRAINER)

Every additional role in the framework — the frozen reference, the weight-sync wire, the algorithm plug-in, the controller — exists to make this three-line loop scale to a 7-billion-parameter policy and a thousand rollouts per step without the loop breaking. We'll meet each in turn.

Interactive · the loop on a 5-action toy

To build intuition before we tackle token-level LLM policies, let's run the loop on the smallest possible policy: a distribution over five actions. Action C has the highest true reward (r=0.9); others are lower. The policy starts uniform. Hit Sample & Update repeatedly — watch the probability mass concentrate on the high-reward action.

Toy policy: 5 actions, 1 best

Each step samples one action from π_θ, observes its reward, and applies one REINFORCE update. Step size η is the bottom slider — turn it up, watch variance kick you around.

η: 0.40

Steps

Last action

—

Last reward

—

Avg reward (EMA)

0.00

Show the JS that runs this widget (≈25 lines)

// True (hidden) reward per action — C is best.
const rewards = [0.1, 0.3, 0.9, 0.4, 0.2];

// θ are unconstrained logits; π = softmax(θ).
let theta = [0, 0, 0, 0, 0];

function softmax(z) { const m = Math.max(...z); const e = z.map(v => Math.exp(v - m));
                      const s = e.reduce((a,b)=>a+b,0); return e.map(v=>v/s); }

function step(eta) {
  const pi = softmax(theta);
  const a  = sample(pi);            // y ~ π_θ
  const r  = rewards[a];            // r(y)
  // ∇ log π_a(θ) = e_a − π    (standard softmax-policy gradient)
  for (let i = 0; i < theta.length; i++) {
    const grad_i = (i === a ? 1 : 0) - pi[i];
    theta[i] += eta * r * grad_i;
  }
}

Where the framework goes from here

The bandit above is a one-step world: one action, one reward, one update. The LLM setting differs in three ways, each of which forces a new piece of the framework into existence:

Sequential. A response is many tokens, sampled autoregressively. The "action" is the whole sequence. Lesson 2 (rollout) is about this.
Reward shaping is hard. Sparse, terminal-only reward leaves you trying to credit-assign over hundreds of tokens. The reference policy + KL anchor (lesson 3) and the advantage estimator (lesson 4) are the response.
The policy is huge. Sampling and training have opposite hardware profiles. Roles 5–7 — trainer, weight-sync, controller — exist to keep both fast at once.

Takeaway

RL = sample, score, multiply log-probability by reward, take a gradient. Everything else is variance reduction or systems engineering. Hold that in your head; we're going to add complexity one layer at a time.