Orientation — the map before the trees

Reinforcement learning can look like a parade of acronyms — Q-learning, DQN, A3C, TRPO, PPO, SAC, GRPO. This page is the map: what RL actually is, the one fork that organizes every method, and how the 31 lessons fit together. Read once, then start lesson 01.

What reinforcement learning actually is

Supervised learning gives you instructive feedback: for each input, the correct label. Reinforcement learning gives you only evaluative feedback: you act, and the world hands back a scalar reward — better or worse, never "here is what you should have done." That single change — evaluative instead of instructive — is the whole subject. It forces you to explore to discover what is good, to assign credit across time when reward is delayed, and to trade off acting well now against learning for later.

The setting is the Markov decision process (lesson 01). An agent in a state takes an action, the environment returns a reward and a next state, and the loop repeats:

repeat:
    observe state  s
    choose action  a ~ π(·|s)        # the policy decides
    receive reward r  and next state s'
    learn from (s, a, r, s')          # make π better

The agent's goal is a policy π that maximizes the expected return — the discounted sum of future rewards. Everything in this course is a different way to find that policy.

The one fork — the spine of the course

There are exactly two ways to get a good policy out of an MDP, and the most powerful methods use both at once:

The sentence to carry through all 31 lessons

Value-based methods learn a number (how good?) and let the policy fall out of it by greedy choice. Policy-based methods learn the policy itself. They keep reconverging: Actor–Critic, GAE, TRPO, PPO, SAC are all "a policy that learns, with a value function that critiques it to keep the update stable." When you meet a new method, first ask: is it learning a value, a policy, or both — and what does it add to keep the update safe?

The three questions Part II keeps answering

Once you can scale the fork with neural networks, three pressures shape every advanced method:

Variance. The policy gradient is estimated from samples and is brutally noisy. Baselines, advantages, and GAE (lessons 07–08) fight variance.
Trust. A single large policy update can destroy the policy. Importance sampling, TRPO's KL trust region, and PPO's clip (lessons 09–11) keep each step safe.
Signal. Where does the reward come from? A verifier, a human-preference model (RLHF, lesson 13), an expert's behavior (inverse RL, lesson 14), or a fixed dataset (offline RL, 16–18).

What each part teaches

Part	Lessons	Question it answers
I · Foundations	01–05	What is an MDP, and what are the two ways to solve it? (value, policy, their reunion in Actor–Critic, planning when the model is known, and exploration.)
II · Advanced	06–18	How do you scale it with deep nets, make the update trustworthy (REINFORCE → TRPO → PPO), reuse it for language models (PPO/DPO/GRPO/RLHF), and push past its assumptions (imitation, continuous control, offline RL)?
III · Applications	19–31	How does each domain — recommendation, robotics, finance, scheduling, NLP, vision, platforms — map back to one of the core methods?

This course vs its sibling

This is the theory course. When you reach lessons 11–13 (PPO/DPO/GRPO/RLHF) you'll see callouts handing off to the RL Post-Training course — the systems side: how those same algorithms run on a GPU cluster (rollout engines, weight sync, kernels). Theory here; engineering there.

The minimal glossary (for the first few lessons)

s, a, r, s′ — state, action, reward, next state. π(a|s) — the policy.
Return G — discounted sum of future rewards, with discount γ ∈ [0,1).
V(s), Q(s,a) — value of a state / of an action: the expected return from there.
Advantage A = Q − V — how much better an action is than average. The workhorse of Part II.
On-policy / off-policy — learning from data the current policy generated, vs from older or other data.
Model-based / model-free — whether you know (or learn) the transition dynamics P and can plan.

Everything else is defined as you meet it. Onward to the MDP.