Orientation — the map before the trees

If you're new to RL, the curriculum below can look like a parade of acronyms. This page is the map before the trees: what RL post-training actually is, the three forces that shape every design choice, and how the 28 lessons that follow fit together. Read once, then dive into lesson 01.

What RL post-training actually is

A pretrained LLM gives you a policy π_θ: feed it a prompt, it samples a response. Supervised fine-tuning (SFT) teaches it by imitation: "here's a prompt, here's the right response, copy it". RL is what you reach for when you can check whether a response is good but can't easily label the perfect one.

The loop is brutally simple:

repeat:
    sample some responses from π_θ        # rollout
    score each response                   # reward / verifier
    push up the log-prob of high scorers  # gradient step
    keep π_θ close to the original π_ref  # KL anchor

Every algorithm in this curriculum is a variation on those four lines. Every system pattern is a different way of running that loop on a real cluster.

The three forces — why the field has the shape it has

The one sentence to take with you

Every algorithm and system pattern you'll meet is a point in signal × estimator × cost. When you see a new paper, ask which of the three it's moving on. The answer is almost always one of the three.

What each part of the curriculum teaches

Part	Lessons	Question it answers
I · Framework	01–08	What are the six roles that surround the loss? (Rollout, reference, algorithm, trainer, weight-sync, controller — and how agentic adds masking.)
II · Algorithms	09–14	What changes when you swap the loss? Each algorithm is a one-line patch on its predecessor: REINFORCE → PPO → GRPO → RLOO → DAPO → Dr.GRPO.
III · Production	15–23	Where does reward come from when there's no verifier? How is the cluster wired? What does the rollout engine actually do? Which recipes do labs publish, and how do they compose?
IV · Engineer	24–25	What does an RL infra engineer actually own day-to-day, and which kernel surfaces do they ship against?
V · Synthesis	26–28	Why does the field look like this in 2026? What's missing from the curriculum? Where do throughput bottlenecks land and how do you diagnose them?

Two ways to read this

Linear (~4.5 hours). Lesson 01 onward. Each lesson assumes the previous; the widgets are calibrated so the surprise of lesson n is visible only after lesson n−1. This is the recommended path if you're new.
Targeted. Use the table above. Working on rollout? Read 02 + 20. Picking an algorithm? Read 04 + 11 + 14. Sizing a cluster? Read 19 + 22 + 28.

The minimal acronym glossary (for the first three lessons)

π_θ — the policy you're training. π_ref — the frozen SFT checkpoint you anchor to.
Rollout — sampling from π_θ. K rollouts — K independent samples per prompt.
RLHF — reward from human preferences via a learned reward model.
RLVR — reward from a verifier (math equality, code tests). The 2024+ default.
KL anchor — penalty for π_θ drifting far from π_ref. Keeps the policy from reward-hacking.
GRPO / DAPO / Dr.GRPO — the post-PPO algorithm family. All variants of "PPO without the critic, with group-relative baseline".

Everything else is defined as you meet it. Onward.