Orientation — the map before the trees
Reinforcement learning can look like a parade of acronyms — Q-learning, DQN, A3C, TRPO, PPO, SAC, GRPO. This page is the map: what RL actually is, the one fork that organizes every method, and how the 31 lessons fit together. Read once, then start lesson 01.
What reinforcement learning actually is
Supervised learning gives you instructive feedback: for each input, the correct label. Reinforcement learning gives you only evaluative feedback: you act, and the world hands back a scalar reward — better or worse, never "here is what you should have done." That single change — evaluative instead of instructive — is the whole subject. It forces you to explore to discover what is good, to assign credit across time when reward is delayed, and to trade off acting well now against learning for later.
The setting is the Markov decision process (lesson 01). An agent in a state takes an action, the environment returns a reward and a next state, and the loop repeats:
repeat:
observe state s
choose action a ~ π(·|s) # the policy decides
receive reward r and next state s'
learn from (s, a, r, s') # make π better
The agent's goal is a policy π that maximizes the expected return — the discounted sum of future rewards. Everything in this course is a different way to find that policy.
The one fork — the spine of the course
There are exactly two ways to get a good policy out of an MDP, and the most powerful methods use both at once:
The three questions Part II keeps answering
Once you can scale the fork with neural networks, three pressures shape every advanced method:
- Variance. The policy gradient is estimated from samples and is brutally noisy. Baselines, advantages, and GAE (lessons 07–08) fight variance.
- Trust. A single large policy update can destroy the policy. Importance sampling, TRPO's KL trust region, and PPO's clip (lessons 09–11) keep each step safe.
- Signal. Where does the reward come from? A verifier, a human-preference model (RLHF, lesson 13), an expert's behavior (inverse RL, lesson 14), or a fixed dataset (offline RL, 16–18).
What each part teaches
| Part | Lessons | Question it answers |
|---|---|---|
| I · Foundations | 01–05 | What is an MDP, and what are the two ways to solve it? (value, policy, their reunion in Actor–Critic, planning when the model is known, and exploration.) |
| II · Advanced | 06–18 | How do you scale it with deep nets, make the update trustworthy (REINFORCE → TRPO → PPO), reuse it for language models (PPO/DPO/GRPO/RLHF), and push past its assumptions (imitation, continuous control, offline RL)? |
| III · Applications | 19–31 | How does each domain — recommendation, robotics, finance, scheduling, NLP, vision, platforms — map back to one of the core methods? |
The minimal glossary (for the first few lessons)
- s, a, r, s′ — state, action, reward, next state. π(a|s) — the policy.
- Return G — discounted sum of future rewards, with discount γ ∈ [0,1).
- V(s), Q(s,a) — value of a state / of an action: the expected return from there.
- Advantage A = Q − V — how much better an action is than average. The workhorse of Part II.
- On-policy / off-policy — learning from data the current policy generated, vs from older or other data.
- Model-based / model-free — whether you know (or learn) the transition dynamics P and can plan.
Everything else is defined as you meet it. Onward to the MDP.