Closing & self-test · 结束语 & 结课测试

Thirty-one lessons, one story. We close by telling that story in a single breath, point you at the systems course that picks up where we stop, and hand you a ten-question self-test to find the spots worth a second read.

The whole arc, as one story

Everything you learned descends from one move: switch from instructive feedback (the right label) to evaluative feedback (a scalar reward). That single change forces a world — the Markov decision process — and a goal: a policy π that maximizes expected discounted return. From there the course is the elaboration of one fork and its reunifications.

To solve an MDP you can learn a value ("how good is this state/action?") and act greedily, or learn the policy directly ("what should I do?"). The two branches keep reconverging — first in Actor–Critic, where a value function critiques a policy that learns. Read the rest as: scale the fork (deep RL), make its update quiet (variance reduction), make its update safe (trust regions), reuse it for language (the LLM era), source the reward when none is given (imitation / IRL / RLHF), drop discrete actions (continuous control), drop the simulator entirely (offline RL), and finally apply it.

#	Stage of the story	The one idea	Lesson(s)
1	The world & the goal	MDP, return, Bellman; the value × policy fork	01
2	Value branch	Bellman optimality → Q-learning → DQN; the deadly triad	02, 06
3	Policy branch	policy gradient ∇J = 𝔼[G·∇log π]; baseline → advantage	03, 07
4	The reunion	Actor–Critic: policy that learns + value that critiques	03, 08
5	Know the model	dynamic programming → MCTS; model-based planning	04
6	Where data comes from	explore vs exploit; bandits, UCB, Thompson	05
7	Scale it	deep RL, parallel actors (A3C/A2C)	06
8	Quiet the update	variance reduction: reward-to-go, baselines, GAE(λ)	07, 08
9	Reuse the data	importance sampling, the ratio ρ = π_θ/π_old	09
10	Make the update safe	trust regions: TRPO's KL constraint → PPO's clip	10
11	The LLM era	PPO/DPO/GRPO; RLHF's KL-to-reference anchor	11, 12, 13
12	Source the reward	behavioral cloning → DAgger → inverse RL / GAIL	14
13	Continuous actions	DDPG → TD3 → SAC (deterministic / max-entropy)	15
14	Remove the simulator	offline RL: distributional shift; BCQ vs CQL	16, 17, 18
15	Apply it	recsys, robotics, finance, scheduling, NLP, vision, platforms	19–31

The skill to keep — place any paper on the value × policy × model map

When you meet a new method, do not memorize its acronym. Locate it on three axes and you have already understood most of it:

Value? — does it learn V or Q (a critic), and is the policy implicit (greedy) or explicit?
Policy? — does it learn π_θ directly, and how does it keep the update safe (baseline, advantage, KL/clip trust region)?
Model? — does it know or learn the dynamics P, R and plan, or is it model-free?

DQN = value, no model, implicit policy. REINFORCE = policy, no value, no model. PPO = both (clipped policy + value critic), no model. AlphaZero = all three. GRPO = policy + a group-mean value surrogate, no critic network. DPO = a policy trained as if there were a reward, with the RL loop folded into a closed form. Same map, different corners.

What to read next — the systems course

This was the theory course: it ends exactly where the engineering begins. The sibling course, RL Post-Training, From First Principles, picks up PPO / GRPO / DPO / RLHF as things you actually run on a GPU cluster — rollout engines, the frozen reference and KL anchor as code, weight-sync wiring, kernels, and the controller that keeps a 7B-parameter policy and a thousand rollouts per step from falling over. If lessons 14–16 left you wanting the implementation, that is the door. Theory here; engineering there.

Interactive · the closing self-test

Ten questions spanning the whole course — MDP & Bellman, value vs policy, the deadly triad, baselines & variance, GAE's λ, the importance ratio, TRPO/PPO's clip, DPO's closed form, GRPO's group baseline, RLHF's KL anchor, offline-RL distributional shift, BCQ vs CQL, and an application mapping. Pick an answer to lock it in and reveal the explanation; then advance. Your score and a short "review these" list appear at the end. Nothing leaves your browser.

Self-test · 10 questions

One correct option each. Selecting an answer reveals why it is right (or why the others are wrong). Use Next to advance; Restart to try again.

Question

1 / 10

Score

Answered

Show the quiz engine (≈30 lines)

// Each question: stem, 4 options, index of correct, explanation, topic + lesson tag.
const QUESTIONS = [ /* { topic, lesson, stem, options:[...], correct, why } ... */ ];

let i = 0, score = 0, answered = 0, locked = false;

function renderOptions(q){
  optionsEl.innerHTML = '';
  q.options.forEach((text, idx) => {
    const b = document.createElement('button');
    b.textContent = text;
    b.onclick = () => choose(idx, b);
    optionsEl.appendChild(b);
  });
}
function choose(idx, btn){
  if (locked) return;            // one answer per question
  locked = true; answered++;
  const q = QUESTIONS[i];
  const ok = (idx === q.correct);
  if (ok) score++;
  // colour the buttons, reveal the explanation, enable Next
  markButtons(q.correct, idx);
  showExplain(ok, q);
  if (i === QUESTIONS.length - 1) showFinal();
}

Takeaway · the one sentence that survives the whole course

Reinforcement learning is learning to act from evaluative feedback inside an MDP; every method is a different answer to "learn a value, a policy, or both — and how do you keep that learning low-variance, trustworthy, and fed with a reward?" Carry the value × policy × model map, and no new acronym will ever be more than a point on it. Thank you for finishing — the systems course is the natural next step.

↑ Back to the RL Foundations index