all lessons / reinforcement learning / 00 · orientation orientation

Orientation — the map before the trees

Eighty-seven lessons, one chain. Read this once; it tells you what the chain is, why it is ordered the way it is, and where you are at any moment.

Reinforcement learning is one idea pursued at three altitudes. The algorithm altitude asks: given an agent that acts and a signal that scores it, how do you improve the policy? The systems altitude asks: when the agent is a large language model and the loop runs across a GPU cluster, how do you build the machine that trains it? The application altitude asks: given a real problem — a recommender, a robot, a power grid — how do you turn it into an RL problem at all, and what makes it hard? This series walks all three, bottom-up, so each altitude rests on the one below.

The single thread

Underneath every lesson is the Markov decision process: a state, a set of actions, a reward, and a transition. Everything else is a response to one difficulty — you only ever see samples, never the true expectation, so every method is a way to estimate a gradient or a value from noisy experience and act on it without diverging. Hold that and the whole series is variations on a theme.

The first fork (Part I)
The foundations split on a single question: do you learn how good states and actions are, or what to do directly? The value branch (Q-learning → DQN) learns a value function and acts greedily with respect to it. The policy branch (policy gradient → Actor–Critic → TRPO/PPO) parameterizes the policy and pushes it uphill. The LLM-era algorithms — PPO, DPO, GRPO, RLHF — all live on the policy branch, which is why Part I spends most of its time there.
The three forces (Part II)
Post-training systems are shaped by three forces in tension: the signal (where reward comes from — a verifier or a reward model), the estimator (which algorithm turns rollouts into a gradient, and how much variance it carries), and the cost (rollout generation dominates wall-clock, so topology, the KV cache, and scheduling decide what is affordable). Every recipe in Part II is a particular balance of the three.
The five-step loop (Parts III–IV)
Applied RL runs one loop on every domain: formulate the MDP (what are state, action, reward?), diagnose the one difficulty that binds it (sparse reward? partial observability? safety? non-stationarity?), engineer the mechanism that removes it, and guard it in production. Part III does this at the level of concepts; Part IV does it as engineering across twenty domains.

The four parts

How to read it

  1. Linearly. Part I's policy gradient is Part II's REINFORCE; Part II's GRPO is the algorithm Part III's NLP and Part IV's domains assume; the trust region from lesson 13 reappears as PPO's clip everywhere after. Skipping forward means importing terms you haven't earned.
  2. By altitude, if you must. Here for the algorithms? Read Part I and stop. Building infrastructure? Skim Part I's policy branch, then live in Part II. Designing for a specific domain? Find it in Part IV and backtrack to the Part I lessons it leans on.
  3. Decompose every method. Value or policy? On-policy or off-policy? What is the reward signal, what is the variance-reduction trick, and what binds the cost? Most of the literature is one choice along each of these.