A linearized tour of reinforcement learning — from the Markov decision process to reasoning-model post-training and beyond. Built so each idea is the smallest possible patch on the one before it.
This series of thirty-two interactive lessons builds RL from scratch. Foundations (01–05) sets up the Markov decision process and the one fork that organizes the whole field: you can learn values or learn the policy — and they keep reconverging. Advanced (06–18) scales that fork with deep networks, makes policy-gradient updates trustworthy (the lineage that runs REINFORCE → Actor-Critic → TRPO → PPO), reuses it for language models (PPO/DPO/GRPO/RLHF), then pushes to the frontiers (imitation, inverse RL, continuous control, offline RL). Applications (19–31) maps each domain — recommendation, robotics, finance, scheduling, NLP, vision, platforms — back to the core method it reuses. Each lesson has one interactive widget so you can grab a knob and feel the consequence.
Who this is for
You can read Python and you know what a neural network is, but RL is new territory. By the end you'll be able to read any modern RL paper and place it on one map: value × policy × model, and know which limitation of which earlier method it is trying to fix.
New here? Start with the map
Read 00 · Orientation first — a 5-minute map of what RL is, the value-vs-policy fork that runs through every lesson, and how the 31 lessons fit together. Then start lesson 01.
Sibling course — the systems side
This is the theory course: classical RL → modern RL. Its sibling, RL Post-Training, From First Principles, is the systems course: how PPO/GRPO/RLHF actually run on a GPU cluster. Lessons 11–13 here hand off to it. Read this one first; read that one when you want to build the loop.
The one fork you're learning
Every RL method is a way to solve a Markov decision process. There are two routes — learn how good things are (value), or learn what to do (policy) — and the most powerful methods combine them. Hold this picture; every lesson is a node on it.
The sentence to take with you
Value-based methods learn how good a state is and read off the policy by greedy choice; policy-based methods learn what to do directly. Actor–Critic uses the value (the critic) to lower the variance of the policy update (the actor). Almost every modern algorithm — TRPO, PPO, GRPO, SAC — is an Actor–Critic with one safety device bolted on.
Part I · Foundations (01–05 · the MDP and the two ways to solve it)
Part II · Advanced (06–18 · scaling, trust regions, the LLM era, the frontiers)
Foundations assumed small problems and a reward that simply appears. Part II scales the fork with deep networks, makes the policy update trustworthy (the REINFORCE → Actor-Critic → TRPO → PPO lineage), reuses that machinery for language models, then pushes past the assumptions: where does the reward come from, what about continuous actions, and what if you can't interact at all?
Linearly. Each lesson assumes the previous one. The widgets are calibrated so the surprise of lesson n is visible only after lesson n−1.
Touch every knob. Every widget has at least one setting that breaks learning. Find it — the bug is the lesson.
Place every paper on the map. When you finish, you should be able to take any RL paper and say which of value / policy / model it moves, and which earlier limitation it fixes.