Reinforcement Learning, From First Principles
One linearized track from the Markov decision process to the systems that train today's reasoning models and the domains that deploy them. Every lesson assumes only the ones before it; every concept is justified from scratch; every trade-off comes with quantitative reasoning.
The series runs in four parts: the foundations (the theory and algorithms), the post-training systems (the engineering that turns those algorithms into reasoning-model training at cluster scale), and two parts on applications — first the domains as concepts mapped back to the core method, then twenty applied domains as production engineering. Read it straight through, or start at the orientation for the map.
Part I · Foundations (lessons 01–21)
The theory, from the Markov decision process to offline RL. The environment contract and reward design; the value branch (Q-learning → DQN) and the policy branch (policy gradient → Actor–Critic); planning and exploration; the policy-gradient lineage up to TRPO/PPO; the LLM-era algorithms (PPO, DPO, GRPO, RLHF) as a map; and the frontier — imitation/inverse RL, continuous control, and offline RL (BCQ, CQL).
Part II · Post-Training Systems (lessons 22–53)
The engineering behind modern reasoning-model training, building bottom-up. The system roles (rollout, reward/reference, trainer, weight sync, controller, agentic); the algorithm family REINFORCE → PPO → GRPO → RLOO → DAPO → Dr.GRPO; RLHF/DPO recipes, PRM/search, environments and data pipelines; cluster topology, the KV cache, scheduling and throughput math; famous recipes, the infra-engineer role, and bottleneck diagnosis.
Part III · Applications — concepts (lessons 54–66)
Each application mapped back to the core method it reuses: recommendation, robotics, finance, resource scheduling, NLP, and vision — then the platforms and tools, and the frontier outlook.
Part IV · Applications — engineering (lessons 67–86)
Twenty applied domains, each run through one loop — formulate the MDP, diagnose the binding difficulty, engineer the mechanism that removes it, guard it in production. Games and autonomous driving, trading and recommendation, robotics and UAVs, energy grids and networks, manufacturing and scheduling, medicine and epidemics.
Closing (lessons 87)
The through-line in one paragraph, and an interactive self-test spanning the whole series.