RL Post-Training, From First Principles

A linearized tour of the system and the algorithms behind modern reasoning models — built so you understand why each piece exists before you see the code.

This series of thirty-two interactive lessons unwraps modern RL post-training from scratch. Part I (lessons 01–08) covers the framework: the six roles that surround the loss in any real post-training pipeline. Part II (lessons 09–14) covers the algorithms: REINFORCE through Dr.GRPO, each one a one-line patch on its predecessor. Part III (lessons 15–23) covers production: the RLHF / DPO lineage, environments and verifiers, system topology, inference engines, memory math, and the famous recipes (R1, Tülu 3, Qwen, o-series) that combine them. Part IV (lessons 24–25) covers the engineer's perspective: what an RL infra engineer actually owns, and the six kernel surfaces they ship against. Part V (lessons 26–28) is a synthesis: the three forces driving the field, the concepts the curriculum doesn't yet cover, and the bottleneck / optimization / diagnosis playbook for the framework. Each lesson has one interactive widget so you can grab a knob and feel the consequence.

Who this is for

You can read Python and you know what a neural network is, but RL post-training (PPO, GRPO, RLHF, RLVR) is new territory. By the end you'll be able to point at any line in RL/framework/ or RL/algorithms/ and say what it is and why.

New to RL? Start here

Read 00 · Orientation first — a 3-minute map of what RL post-training is, the three forces that shape every design choice, and how the 28 lessons that follow fit together. Then dive into lesson 01.

The system you're learning

Every modern RL post-training system (verl, OpenRLHF, TRL, NeMo-RL, SLIME) reduces to the same six-role pipeline. Hover a role to see its job; you'll meet each one in turn.

Part I · The framework (lessons 01–08 · the six roles around the loss)

What is post-training RL?

Why supervised fine-tuning isn't enough. The minimal loop: sample → score → update. The math of the policy gradient, with intuition first.

Rollout — sampling and π_old

What an autoregressive rollout actually is. Why we sample K trajectories per prompt. Why old_logp must be captured at sampling time.

Reward & Reference — verifier + KL anchor

Verifiable rewards (RLVR) vs. reward models (RLHF). Why we need a frozen reference and why the KL term keeps the policy honest.

Algorithm — the plugin interface

The two-method contract (compute_advantages + compute_loss) that every algorithm in Part II implements. A preview of the advantage-assignment widget; the math lives in lessons 09–14.

Trainer — where gradients flow

One forward, one backward, one step. The trainer is algorithm-agnostic; the loss is a callable. Putting old_logp, ref_logp, advantage together into the loss.

Weight sync — closing the loop

Why the trainer's fresh weights must reach the rollout engine every step. What happens when sync is stale. Three deployment patterns.

Controller — the orchestrator

The seven-step loop in one place. Walk through one full training iteration end to end, with each role lighting up as it runs.

Agentic RL — multi-turn + tool masking

When the model uses tools, the trajectory has tokens the model did not generate. Why those tokens get response_mask = 0 and what breaks if they don't.

Part II · The algorithms (lessons 09–14 · what changes when you swap the loss)

Six algorithms, one shared verifiable task. Each is a one-line patch on its predecessor — read them in order and you can recover any modern reasoning-RL recipe from REINFORCE by adding or removing features one at a time.

                    variance                   no critic
REINFORCE ─────────────────────▶  PPO  ─────────────────────▶  GRPO
                      (V_φ + clip)                  (group mean)      │
                                                                       │
                                                  unbiased baseline    │
                                          RLOO ◀─────────────────── ───┤
                                                                       │
                                                  stability fixes      ▼
                                                  DAPO ◀──────── clip-higher,
                                                                dynamic sampling,
                                                                token-level loss,
                                                                overlong shaping

                                                  bias fixes
                                                  Dr.GRPO ◀───────── no /std, no /|y|

REINFORCE — the starting point

The policy-gradient theorem in one derivation. Why R · ∇log π is structurally noisy. Every algorithm in this part is a variance-reduction patch on this line.

PPO — the classical RLHF workhorse

Three patches: a learned value head as baseline, importance-sampling for rollout re-use, and a clipped pessimistic surrogate. With the canonical clip-table widget.

GRPO — drop the critic

Use the group mean of K rollouts as a baseline instead of a learned value head. Memory drops by ~50%; everything else is PPO. The subtle O(1/K) bias.

RLOO — leave-one-out, unbiased

Use the mean of the other K−1 rollouts instead of all K. The algebra: RLOO ≈ GRPO × K/(K−1). Why Cohere also drops clip and std.

DAPO — four practical fixes

Clip-higher (entropy collapse), dynamic sampling (degenerate groups), token-level loss (length bias), overlong soft penalty (false negatives). ~5 lines each.

Dr.GRPO — debiasing

Two divisors in GRPO secretly bias the gradient. Delete them. The result is the honest REINFORCE-with-group-baseline-and-PPO-clip estimator the math actually asks for.

Part III · Production (lessons 15–23 · from one box to a cluster · +18a data · +22a/b/c throughput)

The algorithms in Part II all assume a verifier and an in-process trainer. Part III is the rest of the picture: where the reward signal comes from when there is no verifier (RLHF, DPO), how to build verifiers that don't get hacked, how the cluster is wired, what the inference engine is doing under rollout.generate(), what fits on an H100, and how the famous recipes (R1, Tülu 3, Qwen, o-series) compose all of the above.

RLHF — the original recipe

SFT → Bradley–Terry reward model → PPO. Why three stages exist; the BT log-likelihood derivation with the live widget you can train; reward-hacking and the KL anchor's two distinct jobs.

DPO — RL without RL

The closed-form derivation: rewrite the optimal RL policy in terms of the reward, plug into Bradley–Terry, watch Z(x) cancel. Then the family — IPO, KTO, ORPO, SimPO — and when each beats DPO.

PRM, ORM, & search

Outcome vs process reward models. How a PRM enables Best-of-N rerank, beam search, and MCTS over reasoning chains. Live comparison of greedy, BoN-ORM, BoN-PRM, and PRM-beam at fixed budget.

Environments & verifiers

The five archetypes — math, code, tool-use, web, LLM-as-judge — and what gets hacked in each. Goodhart's law has an RL flavor; the verifier is the task. Live reward-hacking demo.

18a

Data pipelines & curation

In RL, data isn't a target — it's a gradient-signal substrate. The pass-rate curve p(1−p) as the unifying axis; an eight-stage pipeline (source → license → verifier-check → dedup/decon → stratify → mix → online filter → recycle); the four data-driven failure modes. Live signal-density simulator.

System topology

Colocated, disaggregated, fully async. Rollout and training have opposite hardware profiles; how you wire them on the cluster decides everything. Live throughput simulator: pick R, T, sync.

KV cache & PagedAttention

Why decode is memory-bandwidth-bound. KV cache sizing under MHA / GQA / MQA. PagedAttention as virtual memory for KV blocks; RadixAttention's prefix tree. Live KV footprint + bandwidth sizer.

Scheduling tricks

Continuous batching, prefix caching, chunked prefill, speculative decoding. Each one independently a 1.5–3× throughput win on a paged-storage substrate. Live stack-the-tricks throughput simulator.

Memory & throughput math

Train = 18 bytes/param incl optimizer state. Inference = 2 bytes/param + KV cache. Activation memory at long L. The "four copies" cost of an RL job. Live envelope sizer for arbitrary model + cluster shape.

22a

The throughput equation

Derive τ_step from the controller loop: five terms (rollout, verifier, ref, train, sync) composed by topology. Per-role roofline; the "where is my wall-clock going" decision tree; an interactive per-role wall-clock model. The foundation for the two optimization lessons that follow.

22b

Long-tail rollouts — max-of-K, packing, dynamic K

τ_R is the max over K·B trajectories, not the mean. Why decode lengths are log-normal; the straggler-tax formula; three patches in compose order (sequence packing, length cap with shaped penalty, dynamic K); FP8 rollout and the log-prob mismatch trap. Live straggler-tax simulator.

22c

Async pipelining & weight sync

Collapse τ_step from sum-of-terms to max-of-terms by overlapping rollout, train, and sync. Versioned weights + IS correction; the "freshness wall" (Δ ≈ 4–8 trainer steps); the four sync ops (all-gather, cast, broadcast, reshard); five optimizations in compose order; pipeline-parallel × RL bubbles. Live Gantt-style topology simulator.

Famous recipes & failure modes

InstructGPT, R1-Zero, R1, Tülu 3, Qwen-2.5, the o-series sketch — each factored into (data, algorithm, reward, system, scale). A failure-mode taxonomy mapping every symptom to the lesson that explains it.

Part IV · The engineer (lessons 24–25 · what an RL infra engineer owns)

Two lessons that step out of the architecture and into the role. Lesson 24 maps what an RL infra engineer is responsible for — the five layers, the four communication paths, the correctness traps that distinguish a senior. Lesson 25 catalogs the six kernel surfaces those engineers ship against, with the ROI ordering of which kernel to write first.

RL infra engineer — the role

What the job actually is. Five layers (framework / environment / performance / correctness / operations), four communication paths, three reference architectures (colocated / disaggregated / async), the correctness traps that distinguish a senior, and the whiteboard question that defines the role.

Kernels for RL — six surfaces

Rollout / inference, log-prob matching, training (policy update), weight sync, memory & scheduling, agentic primitives. The first kernel you write (prefix-aware paged attention), the silent bug surface (log-prob mismatch), and the ROI hierarchy of optimizations.

Part V · Synthesis (lessons 26–28 · stepping back from the trees)

The first 25 lessons build the system bottom-up. Part V steps back and asks three meta-questions: why the field landed on its current shape, what's missing from the curriculum, and how to make the framework fast from a bottleneck-and-optimization perspective. Read these last if you want the bird's-eye view; read them first if you're new and want a map before diving in.

Why RL is shaping this way today

Three forces — signal, estimator, cost — push the field's design space. Every algorithm and every system pattern is a point in that triangle. Live three-forces simulator that recovers named recipes (R1, RLHF, Tülu 3) from extreme slider settings.

Missing concepts — what isn't yet covered

17 concepts the curriculum names but doesn't develop — PRM training, RM internals, RLAIF, GAE (λ-returns), KL estimator choice, entropy regularization, off-policy correction, FP8, MoE, KV offload, checkpointing, multi-LoRA. Each mapped to where it would slot in.

Bottlenecks, optimizations, & how to find them

Where the wall-clock goes per role (rollout 60–80%, train 10–26%, ref/sync/algo ~10%), the per-role optimization menu, and the 6-step diagnostic playbook (wall-clock attribution → nvidia-smi → torch.profiler → memory peak → Nsight → ρ-histogram). Live optimization-budget allocator.

How to use this

Linearly. Each lesson assumes the previous one. The widgets are calibrated so the surprise of lesson n is visible only after lesson n−1.
Touch every knob. Every interactive widget has at least one configuration that breaks training. Find it. The bugs are the lesson.
Open the code. Each lesson links the corresponding file under rl_framework/. The lessons explain why; the code is what.

Companion code

Part I's framework lives in RL/framework/. Part II's algorithms live in RL/algorithms/ — each lesson 09–14 corresponds to a single Python file you can run standalone. The full pipeline (SFT → CoT → DPO → RLVR) is one level up in gpt_mini/.