RL / lessons / 00 · orientation ~3 min read · before lesson 01

Orientation — the map before the trees

If you're new to RL, the curriculum below can look like a parade of acronyms. This page is the map before the trees: what RL post-training actually is, the three forces that shape every design choice, and how the 28 lessons that follow fit together. Read once, then dive into lesson 01.

What RL post-training actually is

A pretrained LLM gives you a policy πθ: feed it a prompt, it samples a response. Supervised fine-tuning (SFT) teaches it by imitation: "here's a prompt, here's the right response, copy it". RL is what you reach for when you can check whether a response is good but can't easily label the perfect one.

The loop is brutally simple:

repeat:
    sample some responses from π_θ        # rollout
    score each response                   # reward / verifier
    push up the log-prob of high scorers  # gradient step
    keep π_θ close to the original π_ref  # KL anchor

Every algorithm in this curriculum is a variation on those four lines. Every system pattern is a different way of running that loop on a real cluster.

The three forces — why the field has the shape it has

SIGNAL "how do we know it's good?" humans → RMs → verifiers ESTIMATOR "low variance, low bias gradient?" baselines, clip, KL anchor COST "what fits on the cluster?" drop critic · disagg · async The three forces in causal order Better SIGNAL (a verifier) unlocks cheaper ESTIMATORS (drop the critic → GRPO), which unlock simpler COST (fewer model copies, smaller cluster). Every paper moves one of the three. Read each one by asking which. Lessons 03 · 15 · 17 · 18 Lessons 04 · 09 — 14 Lessons 05 · 06 · 19 — 25 · 28
The one sentence to take with you
Every algorithm and system pattern you'll meet is a point in signal × estimator × cost. When you see a new paper, ask which of the three it's moving on. The answer is almost always one of the three.

What each part of the curriculum teaches

PartLessonsQuestion it answers
I · Framework 0108 What are the six roles that surround the loss? (Rollout, reference, algorithm, trainer, weight-sync, controller — and how agentic adds masking.)
II · Algorithms 0914 What changes when you swap the loss? Each algorithm is a one-line patch on its predecessor: REINFORCE → PPO → GRPO → RLOO → DAPO → Dr.GRPO.
III · Production 1523 Where does reward come from when there's no verifier? How is the cluster wired? What does the rollout engine actually do? Which recipes do labs publish, and how do they compose?
IV · Engineer 2425 What does an RL infra engineer actually own day-to-day, and which kernel surfaces do they ship against?
V · Synthesis 2628 Why does the field look like this in 2026? What's missing from the curriculum? Where do throughput bottlenecks land and how do you diagnose them?

Two ways to read this

  1. Linear (~4.5 hours). Lesson 01 onward. Each lesson assumes the previous; the widgets are calibrated so the surprise of lesson n is visible only after lesson n−1. This is the recommended path if you're new.
  2. Targeted. Use the table above. Working on rollout? Read 02 + 20. Picking an algorithm? Read 04 + 11 + 14. Sizing a cluster? Read 19 + 22 + 28.

The minimal acronym glossary (for the first three lessons)

Everything else is defined as you meet it. Onward.