rl_foundations / lessons / 00 · orientation ~5 min read · before lesson 01

Orientation — the map before the trees

Reinforcement learning can look like a parade of acronyms — Q-learning, DQN, A3C, TRPO, PPO, SAC, GRPO. This page is the map: what RL actually is, the one fork that organizes every method, and how the 31 lessons fit together. Read once, then start lesson 01.

What reinforcement learning actually is

Supervised learning gives you instructive feedback: for each input, the correct label. Reinforcement learning gives you only evaluative feedback: you act, and the world hands back a scalar reward — better or worse, never "here is what you should have done." That single change — evaluative instead of instructive — is the whole subject. It forces you to explore to discover what is good, to assign credit across time when reward is delayed, and to trade off acting well now against learning for later.

The setting is the Markov decision process (lesson 01). An agent in a state takes an action, the environment returns a reward and a next state, and the loop repeats:

repeat:
    observe state  s
    choose action  a ~ π(·|s)        # the policy decides
    receive reward r  and next state s'
    learn from (s, a, r, s')          # make π better

The agent's goal is a policy π that maximizes the expected return — the discounted sum of future rewards. Everything in this course is a different way to find that policy.

The one fork — the spine of the course

There are exactly two ways to get a good policy out of an MDP, and the most powerful methods use both at once:

VALUE-BASED Learn how GOOD each state/action is. Act greedily w.r.t. the value. Q-learning, DQN — lessons 02, 06 POLICY-BASED Learn WHAT TO DO directly. Nudge π toward high-return actions. policy gradient — lessons 03, 07 ACTOR–CRITIC (the reunion) The actor is the policy; the critic is the value. The value lowers the variance of the policy update — 03 · 08 · TRPO/PPO 10–13 · SAC 15
The sentence to carry through all 31 lessons
Value-based methods learn a number (how good?) and let the policy fall out of it by greedy choice. Policy-based methods learn the policy itself. They keep reconverging: Actor–Critic, GAE, TRPO, PPO, SAC are all "a policy that learns, with a value function that critiques it to keep the update stable." When you meet a new method, first ask: is it learning a value, a policy, or both — and what does it add to keep the update safe?

The three questions Part II keeps answering

Once you can scale the fork with neural networks, three pressures shape every advanced method:

What each part teaches

PartLessonsQuestion it answers
I · Foundations0105What is an MDP, and what are the two ways to solve it? (value, policy, their reunion in Actor–Critic, planning when the model is known, and exploration.)
II · Advanced0618How do you scale it with deep nets, make the update trustworthy (REINFORCE → TRPO → PPO), reuse it for language models (PPO/DPO/GRPO/RLHF), and push past its assumptions (imitation, continuous control, offline RL)?
III · Applications1931How does each domain — recommendation, robotics, finance, scheduling, NLP, vision, platforms — map back to one of the core methods?
This course vs its sibling
This is the theory course. When you reach lessons 11–13 (PPO/DPO/GRPO/RLHF) you'll see callouts handing off to the RL Post-Training course — the systems side: how those same algorithms run on a GPU cluster (rollout engines, weight sync, kernels). Theory here; engineering there.

The minimal glossary (for the first few lessons)

Everything else is defined as you meet it. Onward to the MDP.