all lessons / reinforcement learning / 87 · Closing & self-test lesson 87 / 87

Closing & self-test · 结束语 & 结课测试

Thirty-one lessons, one story. We close by telling that story in a single breath, point you at the systems course that picks up where we stop, and hand you a ten-question self-test to find the spots worth a second read.

The whole arc, as one story

Everything you learned descends from one move: switch from instructive feedback (the right label) to evaluative feedback (a scalar reward). That single change forces a world — the Markov decision process — and a goal: a policy π that maximizes expected discounted return. From there the course is the elaboration of one fork and its reunifications.

To solve an MDP you can learn a value ("how good is this state/action?") and act greedily, or learn the policy directly ("what should I do?"). The two branches keep reconverging — first in Actor–Critic, where a value function critiques a policy that learns. Read the rest as: scale the fork (deep RL), make its update quiet (variance reduction), make its update safe (trust regions), reuse it for language (the LLM era), source the reward when none is given (imitation / IRL / RLHF), drop discrete actions (continuous control), drop the simulator entirely (offline RL), and finally apply it.

#Stage of the storyThe one ideaLesson(s)
1The world & the goalMDP, return, Bellman; the value × policy fork01
2Value branchBellman optimality → Q-learning → DQN; the deadly triad02, 06
3Policy branchpolicy gradient ∇J = 𝔼[G·∇log π]; baseline → advantage03, 07
4The reunionActor–Critic: policy that learns + value that critiques03, 08
5Know the modeldynamic programming → MCTS; model-based planning04
6Where data comes fromexplore vs exploit; bandits, UCB, Thompson05
7Scale itdeep RL, parallel actors (A3C/A2C)06
8Quiet the updatevariance reduction: reward-to-go, baselines, GAE(λ)07, 08
9Reuse the dataimportance sampling, the ratio ρ = πθold09
10Make the update safetrust regions: TRPO's KL constraint → PPO's clip10
11The LLM eraPPO/DPO/GRPO; RLHF's KL-to-reference anchor11, 12, 13
12Source the rewardbehavioral cloning → DAgger → inverse RL / GAIL14
13Continuous actionsDDPG → TD3 → SAC (deterministic / max-entropy)15
14Remove the simulatoroffline RL: distributional shift; BCQ vs CQL16, 17, 18
15Apply itrecsys, robotics, finance, scheduling, NLP, vision, platforms1931
The skill to keep — place any paper on the value × policy × model map

When you meet a new method, do not memorize its acronym. Locate it on three axes and you have already understood most of it:

DQN = value, no model, implicit policy. REINFORCE = policy, no value, no model. PPO = both (clipped policy + value critic), no model. AlphaZero = all three. GRPO = policy + a group-mean value surrogate, no critic network. DPO = a policy trained as if there were a reward, with the RL loop folded into a closed form. Same map, different corners.

What to read next — the systems course

This was the theory course: it ends exactly where the engineering begins. The sibling course, RL Post-Training, From First Principles, picks up PPO / GRPO / DPO / RLHF as things you actually run on a GPU cluster — rollout engines, the frozen reference and KL anchor as code, weight-sync wiring, kernels, and the controller that keeps a 7B-parameter policy and a thousand rollouts per step from falling over. If lessons 14–16 left you wanting the implementation, that is the door. Theory here; engineering there.

Interactive · the closing self-test

Ten questions spanning the whole course — MDP & Bellman, value vs policy, the deadly triad, baselines & variance, GAE's λ, the importance ratio, TRPO/PPO's clip, DPO's closed form, GRPO's group baseline, RLHF's KL anchor, offline-RL distributional shift, BCQ vs CQL, and an application mapping. Pick an answer to lock it in and reveal the explanation; then advance. Your score and a short "review these" list appear at the end. Nothing leaves your browser.

Self-test · 10 questions
One correct option each. Selecting an answer reveals why it is right (or why the others are wrong). Use Next to advance; Restart to try again.
Question
1 / 10
Score
0
Answered
0

Show the quiz engine (≈30 lines)
// Each question: stem, 4 options, index of correct, explanation, topic + lesson tag.
const QUESTIONS = [ /* { topic, lesson, stem, options:[...], correct, why } ... */ ];

let i = 0, score = 0, answered = 0, locked = false;

function renderOptions(q){
  optionsEl.innerHTML = '';
  q.options.forEach((text, idx) => {
    const b = document.createElement('button');
    b.textContent = text;
    b.onclick = () => choose(idx, b);
    optionsEl.appendChild(b);
  });
}
function choose(idx, btn){
  if (locked) return;            // one answer per question
  locked = true; answered++;
  const q = QUESTIONS[i];
  const ok = (idx === q.correct);
  if (ok) score++;
  // colour the buttons, reveal the explanation, enable Next
  markButtons(q.correct, idx);
  showExplain(ok, q);
  if (i === QUESTIONS.length - 1) showFinal();
}
Takeaway · the one sentence that survives the whole course
Reinforcement learning is learning to act from evaluative feedback inside an MDP; every method is a different answer to "learn a value, a policy, or both — and how do you keep that learning low-variance, trustworthy, and fed with a reward?" Carry the value × policy × model map, and no new acronym will ever be more than a point on it. Thank you for finishing — the systems course is the natural next step.

↑ Back to the RL Foundations index