Closing & self-test · 结束语 & 结课测试
Thirty-one lessons, one story. We close by telling that story in a single breath, point you at the systems course that picks up where we stop, and hand you a ten-question self-test to find the spots worth a second read.
The whole arc, as one story
Everything you learned descends from one move: switch from instructive feedback (the right label) to evaluative feedback (a scalar reward). That single change forces a world — the Markov decision process — and a goal: a policy π that maximizes expected discounted return. From there the course is the elaboration of one fork and its reunifications.
To solve an MDP you can learn a value ("how good is this state/action?") and act greedily, or learn the policy directly ("what should I do?"). The two branches keep reconverging — first in Actor–Critic, where a value function critiques a policy that learns. Read the rest as: scale the fork (deep RL), make its update quiet (variance reduction), make its update safe (trust regions), reuse it for language (the LLM era), source the reward when none is given (imitation / IRL / RLHF), drop discrete actions (continuous control), drop the simulator entirely (offline RL), and finally apply it.
| # | Stage of the story | The one idea | Lesson(s) |
|---|---|---|---|
| 1 | The world & the goal | MDP, return, Bellman; the value × policy fork | 01 |
| 2 | Value branch | Bellman optimality → Q-learning → DQN; the deadly triad | 02, 06 |
| 3 | Policy branch | policy gradient ∇J = 𝔼[G·∇log π]; baseline → advantage | 03, 07 |
| 4 | The reunion | Actor–Critic: policy that learns + value that critiques | 03, 08 |
| 5 | Know the model | dynamic programming → MCTS; model-based planning | 04 |
| 6 | Where data comes from | explore vs exploit; bandits, UCB, Thompson | 05 |
| 7 | Scale it | deep RL, parallel actors (A3C/A2C) | 06 |
| 8 | Quiet the update | variance reduction: reward-to-go, baselines, GAE(λ) | 07, 08 |
| 9 | Reuse the data | importance sampling, the ratio ρ = πθ/πold | 09 |
| 10 | Make the update safe | trust regions: TRPO's KL constraint → PPO's clip | 10 |
| 11 | The LLM era | PPO/DPO/GRPO; RLHF's KL-to-reference anchor | 11, 12, 13 |
| 12 | Source the reward | behavioral cloning → DAgger → inverse RL / GAIL | 14 |
| 13 | Continuous actions | DDPG → TD3 → SAC (deterministic / max-entropy) | 15 |
| 14 | Remove the simulator | offline RL: distributional shift; BCQ vs CQL | 16, 17, 18 |
| 15 | Apply it | recsys, robotics, finance, scheduling, NLP, vision, platforms | 19–31 |
When you meet a new method, do not memorize its acronym. Locate it on three axes and you have already understood most of it:
- Value? — does it learn V or Q (a critic), and is the policy implicit (greedy) or explicit?
- Policy? — does it learn πθ directly, and how does it keep the update safe (baseline, advantage, KL/clip trust region)?
- Model? — does it know or learn the dynamics P, R and plan, or is it model-free?
DQN = value, no model, implicit policy. REINFORCE = policy, no value, no model. PPO = both (clipped policy + value critic), no model. AlphaZero = all three. GRPO = policy + a group-mean value surrogate, no critic network. DPO = a policy trained as if there were a reward, with the RL loop folded into a closed form. Same map, different corners.
This was the theory course: it ends exactly where the engineering begins. The sibling course, RL Post-Training, From First Principles, picks up PPO / GRPO / DPO / RLHF as things you actually run on a GPU cluster — rollout engines, the frozen reference and KL anchor as code, weight-sync wiring, kernels, and the controller that keeps a 7B-parameter policy and a thousand rollouts per step from falling over. If lessons 14–16 left you wanting the implementation, that is the door. Theory here; engineering there.
Interactive · the closing self-test
Ten questions spanning the whole course — MDP & Bellman, value vs policy, the deadly triad, baselines & variance, GAE's λ, the importance ratio, TRPO/PPO's clip, DPO's closed form, GRPO's group baseline, RLHF's KL anchor, offline-RL distributional shift, BCQ vs CQL, and an application mapping. Pick an answer to lock it in and reveal the explanation; then advance. Your score and a short "review these" list appear at the end. Nothing leaves your browser.