RL / lessons / 26 · why RL today lesson 1 / 3 · part V

Why RL is shaping this way today — the synthesis

You met the three forces briefly in lesson 00's map. Now that you've walked the 24 lessons in between, this is the unhurried version: each force in detail, the historical timeline that produced today's field, and the live simulator that lets you recover named recipes from extreme slider settings.

If you skipped to here

Read lesson 00 first — it's a 90-second map that defines the policy, the loop, and the three forces in plain language. This lesson assumes that vocabulary and walks each force in depth.

The one diagram

Every modern RL post-training pipeline is the intersection of three pressures. Each pressure pushes in a direction; the design you end up with is wherever the three meet.

SIGNAL pressure "where does reward come from?" humans → RMs → verifiers COST pressure "what fits on a cluster?" drop critic, drop RM, async ESTIMATOR pressure "low variance, low bias" baselines, clips, KL anchor 2026 RL GRPO + verifier + disagg

Read it like this: SIGNAL is the question "how do we know if the output is good?" — a moving target as tasks got harder. ESTIMATOR is the statistical question "given some signal, how do we turn it into a gradient without poisoning the optimizer?" — the variance-and-bias story you walked through in lessons 09–14. COST is the hardware question "what can we actually run on 8× H100?" — memory, bandwidth, scheduling. Every algorithm and every system pattern in the curriculum is a point in that triangle.

One nuance: the three forces aren't independent — they form a causal chain. Better signal (a verifier) unlocks cheaper estimators (drop the critic → GRPO), which in turn unlocks simpler topologies (fewer model copies to ship around). Picture three arrows: signal → estimator → cost. When a paper claims a new SOTA, the question to ask is "which of the three arrows did they move, and what did that let the next one do?"

Force 1 · Signal pressure — where does reward come from?

Why this even matters: RL needs a reward function. Supervised fine-tuning doesn't — you label "the right answer" and minimize cross-entropy. RL is what you reach for when labelling the right answer is harder than checking it. The history of post-training RL is the history of how the field answered the "where does reward come from" question.

EraReward sourceWhy it had to change
2017–2022: RLHF
(InstructGPT)
A reward model trained on ~30k human pairwise preferences Humans are slow (30–120 s per pair) and expensive. Cap on data size means cap on signal quality. The RM is also a learned proxy — it can be reward-hacked. Lesson 15.
2023: DPO & the preference-loss family The same preferences, but consumed as a supervised loss (no rollouts) RLHF with PPO needs four model copies (policy + ref + RM + critic) and a rollout loop. DPO collapses that to two model copies (policy + ref) with one forward through each per training pair — cheaper to run, but you also lose the on-policy exploration. Lesson 16.
2024: RLVR — verifiable rewards An executable verifier: math equality, code unit tests, regex match For math/code, you don't need a human or a learned proxy — the reward is a function. Free, fast (~ms), and much harder to hack than an RM (though not impossible — see lesson 18's format-exploit demo). This unlocks GRPO-style training at scale.
2025: Mixed (R1, Tülu 3, Qwen-3) Verifier where you can; RM/judge where you can't; sometimes PRM for dense credit One source isn't enough for a real product. Reasoning gets verifier-driven RL; "be helpful, harmless, honest" gets RM-driven RL; readability comes from rejection-sampled SFT. Lesson 23.
2026: Verifier-everywhere
+ agentic environments
Verifiers extended to web, tool-use, code-editing tasks (SWE-bench) The frontier is "make more tasks verifiable" rather than "scale human preferences". Verifier engineering is becoming a discipline of its own. Lesson 8, Lesson 18.

The mental model to walk away with: RLHF didn't fail; it got factored. The expensive part (humans) was replaced where possible with verifiers; the inexpensive but biased part (the RM) was kept where verifiers don't exist; and the parts of RL that always worked (KL anchor, policy gradient) survived through it all.

The "why GRPO won" answer in one sentence
Verifiable rewards (math, code) became the dominant training signal in 2024–2025, and once you have a verifier the value head in PPO becomes a memory-hungry liability — the group mean of K rollouts is a free baseline for free. GRPO is "PPO minus the critic" once you can afford K rollouts per prompt.

Force 2 · Estimator pressure — variance and bias, told as a story

This is the lesson 09–14 story, retold as one continuous narrative now that you've met each patch on its own. Strip away the acronyms and there's one core question:

θ J(θ) = Ey∼πθ[ R(y) · ∇θ log πθ(y) ]

That's the policy-gradient theorem. In English: "to push the policy toward high-reward outputs, sample some outputs, and increase their log-probability in proportion to how much reward they got." The whole zoo of algorithms is patches on this one expression. Each patch fixes one specific way it goes wrong in practice.

PatchWhat it fixesAlgorithm where it first appears
Subtract a baselineR(y) by itself is huge and noisy. Subtracting a baseline that doesn't depend on y leaves the gradient unchanged in expectation but drops variance by orders of magnitude.Anything past raw REINFORCE
Learn the baselineThe optimal baseline is E[R(y)] — but we don't know it. Train a value head Vφ(s) to estimate it.PPO (lesson 10)
Use the group as a baselineValue heads are expensive (second model). For verifiable tasks, sample K rollouts per prompt and use their mean as the baseline. Free.GRPO (lesson 11)
Use leave-one-outThe group mean uses yi in its own baseline → biased at finite K. Leave-one-out fixes the bias.RLOO (lesson 12)
Clip the ratioReuse old rollouts (importance sampling) — but ratios can blow up. Clipping bounds the damage.PPO clip (lesson 10)
Anchor with KLThe policy can wander off into reward-hacking territory. Penalize KL divergence from a frozen reference.RLHF + everything since
Asymmetric clipSymmetric clipping kills exploration. Let positives clip higher than negatives.DAPO (lesson 13)
Token-level lossPer-rollout aggregation underweights long correct outputs. Sum over all tokens, divide by total mask.DAPO + Dr.GRPO (lessons 13–14)
Drop the /stdStd-normalization amplifies low-variance prompts (easiest/hardest) — a bias.Dr.GRPO (lesson 14)

If you internalize one thing about the algorithm parade: every algorithm is "policy gradient + a list of patches". They differ in which patches they keep. Read any new paper as "which boxes do they check?".

Force 3 · Cost pressure — what actually fits on the cluster

The third force is the one that makes RL post-training a systems problem rather than a math problem. To turn the gradient above into a trained model, you need:

Five components, three different hardware profiles, and they all have to compose into one training loop. The arithmetic from lesson 22:

training: 18 bytes / param  |  inference: 2 bytes / param + KV cache  |  KV / token ≈ 130 KB at 7B with GQA

That arithmetic is what drives the design. A 70B model in training takes ~1.26 TB of GPU memory; in inference it takes ~140 GB plus KV cache. You can't put both copies on the same GPUs without making one wait for the other. You also can't broadcast 140 GB of weights every step without crushing the network. So the cluster topology has to:

  1. Separate rollout GPUs from trainer GPUs (disaggregation).
  2. Use the most aggressive inference tricks on the rollout side (paged KV, continuous batching, prefix caching).
  3. Use the most aggressive training tricks on the trainer side (FSDP, sequence parallel, activation checkpointing).
  4. Build a fast bridge between them (NCCL broadcast + dtype cast + reshard).
  5. Hide that bridge under the next rollout step (async / overlapped sync).

Everything you read about verl, OpenRLHF, SLIME, NeMo-RL, TRL is a different point on the spectrum of "how aggressively did we hide the bridge?". The five frameworks differ in where each role runs and how data moves between them — not what the roles are.

The three forces meet — and produce the 2026 reference recipe

If you put the three forces together honestly, the recipe almost falls out by itself:

PressureWants you to use
Signal (verifier where possible, RM elsewhere)RLVR for math/code; RM-driven RL for general assistance; mix them in stages.
Estimator (variance/bias clean)GRPO or Dr.GRPO with PPO clip + KL anchor. Token-level loss. K ≥ 4 rollouts per prompt.
Cost (separate roles)Disaggregated topology: vLLM/SGLang for rollout, FSDP for trainer, NCCL broadcast for weight sync.

And that's the recipe. R1, Tülu 3, Qwen-2.5/3, and every open-source reasoning model in 2025 lands inside this box. The recipes differ in data mixing, cold-start SFT, rejection sampling between stages, and RM design for the non-verifiable parts — not in the loss or the topology.

Interactive · feel the three forces

Use the sliders below to set each pressure independently, and the simulator reports which recipe you'd land on. The mapping is intentionally simple: it's not a real model — it's a decision diagram with knobs, so the surfaces are visible.

Three-forces simulator
Move the sliders. The recommended recipe row at the bottom updates live. Try the extremes — they recover named recipes from the literature.
Reward source
Algorithm
Topology
Closest named recipe

What this means if you're new to RL

If you opened lesson 01 and made it here, the curriculum probably feels like one long answer to a question you never quite knew you were asking. To summarize what the question is:

The question every RL lesson is answering

"How do we improve a pretrained language model on a task where we can check the answer but can't easily label it — while spending no more compute than we have to, and without the model gaming the checker?"

Every algorithm is a different way of computing the gradient. Every system pattern is a different way of running the loop. Every recipe is a different way of staging the signal.

If you remember nothing else, remember this: signal × estimator × cost. When you see a new paper, ask which of the three it's moving on. The answer is almost always one of the three.

Where to go from here

  1. Re-read with the three forces in mind. Lessons 09–14 are the estimator story; lessons 15–18 are the signal story; lessons 19–23 are the cost story. The triangle is your map.
  2. Go to lesson 27 to see what the curriculum doesn't cover yet — the candidate concepts that would extend each of the three forces.
  3. Go to lesson 28 for the systems-perspective answer to "how do you make this fast?": throughput bottlenecks, optimizations, and how to identify them in practice.