RL Post-Training, From First Principles
A linearized tour of the system and the algorithms behind modern reasoning models — built so you understand why each piece exists before you see the code.
This series of thirty-two interactive lessons unwraps modern RL post-training from scratch. Part I (lessons 01–08) covers the framework: the six roles that surround the loss in any real post-training pipeline. Part II (lessons 09–14) covers the algorithms: REINFORCE through Dr.GRPO, each one a one-line patch on its predecessor. Part III (lessons 15–23) covers production: the RLHF / DPO lineage, environments and verifiers, system topology, inference engines, memory math, and the famous recipes (R1, Tülu 3, Qwen, o-series) that combine them. Part IV (lessons 24–25) covers the engineer's perspective: what an RL infra engineer actually owns, and the six kernel surfaces they ship against. Part V (lessons 26–28) is a synthesis: the three forces driving the field, the concepts the curriculum doesn't yet cover, and the bottleneck / optimization / diagnosis playbook for the framework. Each lesson has one interactive widget so you can grab a knob and feel the consequence.
RL/framework/ or RL/algorithms/ and say what it is and why.
The system you're learning
Every modern RL post-training system (verl, OpenRLHF, TRL, NeMo-RL, SLIME) reduces to the same six-role pipeline. Hover a role to see its job; you'll meet each one in turn.
Part I · The framework (lessons 01–08 · the six roles around the loss)
old_logp must be captured at sampling time.response_mask = 0 and what breaks if they don't.Part II · The algorithms (lessons 09–14 · what changes when you swap the loss)
Six algorithms, one shared verifiable task. Each is a one-line patch on its predecessor — read them in order and you can recover any modern reasoning-RL recipe from REINFORCE by adding or removing features one at a time.
variance no critic
REINFORCE ─────────────────────▶ PPO ─────────────────────▶ GRPO
(V_φ + clip) (group mean) │
│
unbiased baseline │
RLOO ◀─────────────────── ───┤
│
stability fixes ▼
DAPO ◀──────── clip-higher,
dynamic sampling,
token-level loss,
overlong shaping
bias fixes
Dr.GRPO ◀───────── no /std, no /|y|
Part III · Production (lessons 15–23 · from one box to a cluster · +18a data · +22a/b/c throughput)
The algorithms in Part II all assume a verifier and an in-process trainer. Part III is the rest of the picture: where the reward signal comes from when there is no verifier (RLHF, DPO), how to build verifiers that don't get hacked, how the cluster is wired, what the inference engine is doing under rollout.generate(), what fits on an H100, and how the famous recipes (R1, Tülu 3, Qwen, o-series) compose all of the above.
Part IV · The engineer (lessons 24–25 · what an RL infra engineer owns)
Two lessons that step out of the architecture and into the role. Lesson 24 maps what an RL infra engineer is responsible for — the five layers, the four communication paths, the correctness traps that distinguish a senior. Lesson 25 catalogs the six kernel surfaces those engineers ship against, with the ROI ordering of which kernel to write first.
Part V · Synthesis (lessons 26–28 · stepping back from the trees)
The first 25 lessons build the system bottom-up. Part V steps back and asks three meta-questions: why the field landed on its current shape, what's missing from the curriculum, and how to make the framework fast from a bottleneck-and-optimization perspective. Read these last if you want the bird's-eye view; read them first if you're new and want a map before diving in.
How to use this
- Linearly. Each lesson assumes the previous one. The widgets are calibrated so the surprise of lesson n is visible only after lesson n−1.
- Touch every knob. Every interactive widget has at least one configuration that breaks training. Find it. The bugs are the lesson.
- Open the code. Each lesson links the corresponding file under
rl_framework/. The lessons explain why; the code is what.
RL/framework/. Part II's algorithms live in RL/algorithms/ — each lesson 09–14 corresponds to a single Python file you can run standalone. The full pipeline (SFT → CoT → DPO → RLVR) is one level up in gpt_mini/.