What is post-training RL?
The minimum viable loop, and the one equation that makes it work.
The setup, in one sentence
Supervised fine-tuning (SFT) teaches a model by showing it correct outputs. RL post-training teaches a model by letting it try, scoring what it produced, and nudging its probability mass toward attempts that scored well.
The distinction matters because for many tasks — math problems, code, multi-step reasoning — we have no labeled "correct chain of thought" to imitate. We only have a way to check if a final answer is right. That checker (a verifier, a unit-test runner, a reward model) is enough for RL; it is not enough for supervised learning.
- πθ — the policy: the LLM we're training. Given a prompt, it produces a probability distribution over next tokens; sampling from it produces a response.
- x — a prompt. For us, e.g.
<3+4+5>. - y — a response. The full string the policy generates token-by-token.
- r(x, y) — the reward. For verifiable tasks this is binary: 1 if the answer is right, 0 otherwise.
The objective
We want to find policy weights θ that maximize the expected reward when we sample from the policy:
Read aloud: "average reward, over prompts drawn from our task distribution and responses drawn from the policy." If we knew ∇θ J, we'd just do gradient ascent: θ ← θ + η · ∇θ J. So the question is: how do we compute that gradient?
The one equation
Direct differentiation fails — y is sampled from πθ, so the distribution we're averaging over depends on the thing we're differentiating. The way out is a single identity, the log-derivative trick:
The derivation is one line: write the expectation as a sum, differentiate, multiply and divide by πθ(y), use ∇log π = ∇π / π, write back as an expectation. The consequence is huge: you can estimate the gradient by sampling. Sample y from the policy, multiply log πθ(y) by its observed reward, take a gradient. That is REINFORCE in its simplest form.
Intuition: push log-probability up in proportion to reward. The reward acts as a per-sample learning-rate — high-reward samples move more than low-reward ones. (If rewards are strictly positive — as in the bandit below — every sampled action gets pushed up, just by different amounts. Subtracting a baseline from the reward, which we'll do in lesson 4, is what makes below-average actions get pushed down.)
The minimum viable loop
Strip away every optimization, every algorithm, every system concern. The loop is:
repeat:
y ~ πθ(·|x) # sample a response from the policy (ROLLOUT)
r = verify(y) # score it (ENV / REWARD)
θ ← θ + η · r · ∇θ log πθ(y) # update toward sampled high-reward y (ALGORITHM + TRAINER)
Every additional role in the framework — the frozen reference, the weight-sync wire, the algorithm plug-in, the controller — exists to make this three-line loop scale to a 7-billion-parameter policy and a thousand rollouts per step without the loop breaking. We'll meet each in turn.
Interactive · the loop on a 5-action toy
To build intuition before we tackle token-level LLM policies, let's run the loop on the smallest possible policy: a distribution over five actions. Action C has the highest true reward (r=0.9); others are lower. The policy starts uniform. Hit Sample & Update repeatedly — watch the probability mass concentrate on the high-reward action.
Where the framework goes from here
The bandit above is a one-step world: one action, one reward, one update. The LLM setting differs in three ways, each of which forces a new piece of the framework into existence:
- Sequential. A response is many tokens, sampled autoregressively. The "action" is the whole sequence. Lesson 2 (rollout) is about this.
- Reward shaping is hard. Sparse, terminal-only reward leaves you trying to credit-assign over hundreds of tokens. The reference policy + KL anchor (lesson 3) and the advantage estimator (lesson 4) are the response.
- The policy is huge. Sampling and training have opposite hardware profiles. Roles 5–7 — trainer, weight-sync, controller — exist to keep both fast at once.