Reinforcement Learning: From MDPs to Frontiers

A linearized tour of reinforcement learning — from the Markov decision process to reasoning-model post-training and beyond. Built so each idea is the smallest possible patch on the one before it.

This series of thirty-two interactive lessons builds RL from scratch. Foundations (01–05) sets up the Markov decision process and the one fork that organizes the whole field: you can learn values or learn the policy — and they keep reconverging. Advanced (06–18) scales that fork with deep networks, makes policy-gradient updates trustworthy (the lineage that runs REINFORCE → Actor-Critic → TRPO → PPO), reuses it for language models (PPO/DPO/GRPO/RLHF), then pushes to the frontiers (imitation, inverse RL, continuous control, offline RL). Applications (19–31) maps each domain — recommendation, robotics, finance, scheduling, NLP, vision, platforms — back to the core method it reuses. Each lesson has one interactive widget so you can grab a knob and feel the consequence.

Who this is for

You can read Python and you know what a neural network is, but RL is new territory. By the end you'll be able to read any modern RL paper and place it on one map: value × policy × model, and know which limitation of which earlier method it is trying to fix.

New here? Start with the map

Read 00 · Orientation first — a 5-minute map of what RL is, the value-vs-policy fork that runs through every lesson, and how the 31 lessons fit together. Then start lesson 01.

Sibling course — the systems side

This is the theory course: classical RL → modern RL. Its sibling, RL Post-Training, From First Principles, is the systems course: how PPO/GRPO/RLHF actually run on a GPU cluster. Lessons 11–13 here hand off to it. Read this one first; read that one when you want to build the loop.

The one fork you're learning

Every RL method is a way to solve a Markov decision process. There are two routes — learn how good things are (value), or learn what to do (policy) — and the most powerful methods combine them. Hold this picture; every lesson is a node on it.

The sentence to take with you

Value-based methods learn how good a state is and read off the policy by greedy choice; policy-based methods learn what to do directly. Actor–Critic uses the value (the critic) to lower the variance of the policy update (the actor). Almost every modern algorithm — TRPO, PPO, GRPO, SAC — is an Actor–Critic with one safety device bolted on.

Part I · Foundations (01–05 · the MDP and the two ways to solve it)

RL overview — the MDP & the agent–environment loop

State, action, reward, transition, discount. Return and value. The Markov property. The goal — an optimal policy — and the fork that splits the rest of the course: learn values, or learn the policy.

Value-based — from Q-learning to DQN

Bellman optimality removes the policy via a max. TD learning, off-policy bootstrap, the greedy policy. Then the deep version: DQN = Q-learning + neural net + replay + target net, and the "deadly triad" that makes it hard.

Policy-based — from policy gradient to Actor–Critic

When argmax over actions breaks, parameterize the policy. The policy-gradient theorem; REINFORCE; why it's noisy; subtract a baseline; use the value as the baseline → Actor–Critic, where the two branches reconverge.

Model & planning — from dynamic programming to MCTS

If you know the MDP you can plan, not sample. Policy/value iteration as the ground truth TD approximates; the curse of dimensionality; Monte Carlo Tree Search and UCT; how AlphaZero fuses value, policy, and search.

Exploration vs exploitation — bandits to Thompson sampling

Where does the data come from? Strip the MDP to one state — the multi-armed bandit. Regret, ε-greedy, UCB optimism (the same bonus as UCT), and posterior sampling. Contextual bandits as the bridge to recommendation.

Part II · Advanced (06–18 · scaling, trust regions, the LLM era, the frontiers)

Foundations assumed small problems and a reward that simply appears. Part II scales the fork with deep networks, makes the policy update trustworthy (the REINFORCE → Actor-Critic → TRPO → PPO lineage), reuses that machinery for language models, then pushes past the assumptions: where does the reward come from, what about continuous actions, and what if you can't interact at all?

Deep RL — from DQN to A3C

Scale both branches with neural nets. Double/dueling/prioritized DQN in one line each; then parallel on-policy actors (A3C/A2C) where decorrelated experience replaces the replay buffer.

Policy gradient, rigorously

Prove the policy-gradient theorem. The score function, reward-to-go, why baselines don't bias the gradient, and the variance decomposition that makes variance the villain of the next three lessons.

Advantage functions — Actor-Critic, GAE, the road to TRPO

The advantage A = Q − V, the TD-error as a one-step estimate, n-step returns, and GAE(λ) as the bias–variance dial. Why raw gradient steps can be catastrophic — and why we need a trust region.

Importance sampling — on-policy vs off-policy

Reuse data from an old policy by reweighting with the ratio ρ = π/π_old. On- vs off-policy made precise; why a large ratio explodes the variance; why that forces π to stay near π_old.

TRPO — natural gradient, the KL trust region → PPO

The surrogate objective, the performance-difference lemma, the total-variation→KL bound, and the natural gradient. Then why PPO replaced TRPO's hard constraint with a cheap clipped ratio.

The LLM era (上) — PPO, DPO, GRPO: the map

When the "actions" are tokens, the same TRPO/PPO machinery trains a language model. PPO's clipped objective as cheap TRPO; the KL-to-reference anchor; and a map of DPO and GRPO. Hands off to the systems course.

The LLM era (下) — DPO & GRPO, derived

DPO's closed form: the optimal KL-regularized policy makes the reward a log-ratio, plug into Bradley–Terry, watch the partition function cancel — RL without RL. GRPO: drop the critic, use the group mean as the baseline.

RLHF — the post-training workflow

The end-to-end recipe: SFT → reward model (Bradley–Terry) → PPO with a KL anchor. Reward hacking and the anchor's two jobs; RLHF vs RLVR vs RLAIF; where DPO short-circuits the loop.

Frontier — imitation learning & inverse RL

What if there's no reward, only expert demonstrations? Behavioral cloning and its compounding error; DAgger; inverse RL recovers the reward; GAIL as a GAN where the discriminator is the learned reward.

Frontier — discrete to continuous control

When actions are torque vectors, argmax breaks. The deterministic policy gradient → DDPG → TD3's twin critics (fixing overestimation) → SAC's maximum-entropy RL and the reparameterization trick.

Frontier — offline (batch) RL

Learning from a fixed dataset with no new interaction. The core failure — distributional shift: bootstrapping queries Q at out-of-distribution actions and the error compounds. Three families of fixes.

Offline RL — BCQ

Constrain the policy to the data. A generative model of the behavior policy proposes only in-support actions; the policy maximizes Q over those. The policy-constraint pole of offline RL.

Offline RL — CQL

Constrain the value instead. Push Q down on out-of-distribution actions so the policy is never fooled into them — a conservative lower bound. BCQ vs CQL: the two poles of offline RL.

Part III · Applications (19–31 · each domain mapped back to the method it reuses)

The same handful of ideas, applied. Every lesson opens by naming which core lesson it reuses and closes by placing the domain back on the map.

Recommendation systems (上) — personalized recommendation

Recommendation as a contextual bandit (reuses lesson 05): user context is the state, the item is the action, the click is the reward. Cold-start exploration and the feedback-loop trap.

Recommendation & advertising (下)

From one item to a slate and a session (full MDP): long-term value vs greedy CTR, ad bidding as a constrained MDP, and off-policy evaluation because you can't A/B everything (reuses lessons 07, 09).

Robot control (上) — manipulators

Robot control as a continuous-action MDP (reuses lesson 15). Reward shaping, and the first look at the simulation gap.

Robot control (中) — sim-to-real & sample efficiency

Domain randomization, model-based RL for sample efficiency, and safe exploration (reuses lessons 04, 15, 16).

Robot control (下) — autonomous driving

Why pure online RL is unsafe on the road → imitation + offline RL + inverse RL for the reward; multi-agent interaction (reuses lessons 14, 16–18).

Financial trading (上) — stock trading

Trading as a POMDP: price is the (partial) state, buy/hold/sell the action, PnL minus cost the reward. Non-stationarity, transaction costs, and backtest overfitting (reuses lessons 01, 05).

Financial trading (下) — portfolio optimization

Portfolio weights as a continuous action; risk-adjusted reward; why offline and conservative methods matter when the money is real (reuses lessons 15, 16–18).

Resource scheduling — cloud computing to logistics

Job scheduling, bin-packing, and vehicle routing as sequential decisions over huge discrete action spaces; policy gradient for combinatorial optimization (reuses lesson 07).

NLP (上) — machine translation

Text generation as a token-action MDP. Why cross-entropy ≠ the eval metric → REINFORCE on a sequence-level reward like BLEU; exposure bias (reuses lesson 07).

NLP (下) — dialogue systems

Dialogue as multi-turn RL; RLHF for assistants. This is where this course meets the systems course — the bridge to RL post-training (reuses lessons 11–13).

Computer vision — detection to image generation

RL for non-differentiable decisions: hard attention and detection; aligning image generators with a reward (DDPO); the GAN↔RL connection from lesson 14.

Platforms & tools — from OpenAI Gym to Ray

The Gym reset/step API that turns lesson 01's loop into code; vectorized envs; RLlib/Ray as the A3C idea productionized; Stable-Baselines3, CleanRL.

Frontiers & outlook — from AGI to human-AI collaboration

RL in reasoning models and agents; the open problems (sample efficiency, reward specification, safety, generalization, multi-agent); world models; human-AI collaboration.

Finale

Closing & self-test

A one-paragraph recap of the whole arc, what to read next, and an interactive ten-question self-test scored in your browser.

How to use this

Linearly. Each lesson assumes the previous one. The widgets are calibrated so the surprise of lesson n is visible only after lesson n−1.
Touch every knob. Every widget has at least one setting that breaks learning. Find it — the bug is the lesson.
Place every paper on the map. When you finish, you should be able to take any RL paper and say which of value / policy / model it moves, and which earlier limitation it fixes.