Orientation — the map before the trees
Eighty-seven lessons, one chain. Read this once; it tells you what the chain is, why it is ordered the way it is, and where you are at any moment.
Reinforcement learning is one idea pursued at three altitudes. The algorithm altitude asks: given an agent that acts and a signal that scores it, how do you improve the policy? The systems altitude asks: when the agent is a large language model and the loop runs across a GPU cluster, how do you build the machine that trains it? The application altitude asks: given a real problem — a recommender, a robot, a power grid — how do you turn it into an RL problem at all, and what makes it hard? This series walks all three, bottom-up, so each altitude rests on the one below.
The single thread
Underneath every lesson is the Markov decision process: a state, a set of actions, a reward, and a transition. Everything else is a response to one difficulty — you only ever see samples, never the true expectation, so every method is a way to estimate a gradient or a value from noisy experience and act on it without diverging. Hold that and the whole series is variations on a theme.
The four parts
- Part I · Foundations (01–21) — the theory and algorithms, MDP → value/policy methods → planning & exploration → policy-gradient lineage → the LLM-era map → imitation, continuous control, and offline RL.
- Part II · Post-Training Systems (22–53) — the engineering of reasoning-model training: the system roles, the algorithm family (REINFORCE → Dr.GRPO), recipes, environments and data, cluster topology, and the throughput math.
- Part III · Applications — concepts (54–66) — recommendation, robotics, finance, scheduling, NLP, and vision, each mapped back to the method it reuses; plus platforms and the outlook.
- Part IV · Applications — engineering (67–86) — twenty applied domains run through the five-step loop, from games and driving to energy, networks, medicine, and epidemics.
- Closing (87) — the through-line in one paragraph and a self-test.
How to read it
- Linearly. Part I's policy gradient is Part II's REINFORCE; Part II's GRPO is the algorithm Part III's NLP and Part IV's domains assume; the trust region from lesson 13 reappears as PPO's clip everywhere after. Skipping forward means importing terms you haven't earned.
- By altitude, if you must. Here for the algorithms? Read Part I and stop. Building infrastructure? Skim Part I's policy branch, then live in Part II. Designing for a specific domain? Find it in Part IV and backtrack to the Part I lessons it leans on.
- Decompose every method. Value or policy? On-policy or off-policy? What is the reward signal, what is the variance-reduction trick, and what binds the cost? Most of the literature is one choice along each of these.