Orientation — the map before the trees
If you're new to RL, the curriculum below can look like a parade of acronyms. This page is the map before the trees: what RL post-training actually is, the three forces that shape every design choice, and how the 28 lessons that follow fit together. Read once, then dive into lesson 01.
What RL post-training actually is
A pretrained LLM gives you a policy πθ: feed it a prompt, it samples a response. Supervised fine-tuning (SFT) teaches it by imitation: "here's a prompt, here's the right response, copy it". RL is what you reach for when you can check whether a response is good but can't easily label the perfect one.
The loop is brutally simple:
repeat:
sample some responses from π_θ # rollout
score each response # reward / verifier
push up the log-prob of high scorers # gradient step
keep π_θ close to the original π_ref # KL anchor
Every algorithm in this curriculum is a variation on those four lines. Every system pattern is a different way of running that loop on a real cluster.
The three forces — why the field has the shape it has
What each part of the curriculum teaches
| Part | Lessons | Question it answers |
|---|---|---|
| I · Framework | 01–08 | What are the six roles that surround the loss? (Rollout, reference, algorithm, trainer, weight-sync, controller — and how agentic adds masking.) |
| II · Algorithms | 09–14 | What changes when you swap the loss? Each algorithm is a one-line patch on its predecessor: REINFORCE → PPO → GRPO → RLOO → DAPO → Dr.GRPO. |
| III · Production | 15–23 | Where does reward come from when there's no verifier? How is the cluster wired? What does the rollout engine actually do? Which recipes do labs publish, and how do they compose? |
| IV · Engineer | 24–25 | What does an RL infra engineer actually own day-to-day, and which kernel surfaces do they ship against? |
| V · Synthesis | 26–28 | Why does the field look like this in 2026? What's missing from the curriculum? Where do throughput bottlenecks land and how do you diagnose them? |
Two ways to read this
- Linear (~4.5 hours). Lesson 01 onward. Each lesson assumes the previous; the widgets are calibrated so the surprise of lesson n is visible only after lesson n−1. This is the recommended path if you're new.
- Targeted. Use the table above. Working on rollout? Read 02 + 20. Picking an algorithm? Read 04 + 11 + 14. Sizing a cluster? Read 19 + 22 + 28.
The minimal acronym glossary (for the first three lessons)
- πθ — the policy you're training. πref — the frozen SFT checkpoint you anchor to.
- Rollout — sampling from πθ. K rollouts — K independent samples per prompt.
- RLHF — reward from human preferences via a learned reward model.
- RLVR — reward from a verifier (math equality, code tests). The 2024+ default.
- KL anchor — penalty for πθ drifting far from πref. Keeps the policy from reward-hacking.
- GRPO / DAPO / Dr.GRPO — the post-PPO algorithm family. All variants of "PPO without the critic, with group-relative baseline".
Everything else is defined as you meet it. Onward.