Frontiers & outlook — from AGI to human-AI collaboration
Thirty lessons built one machine: an MDP, two ways to solve it, and a long campaign to scale, stabilize, and source the reward. This lesson points that machine at the open frontier — reasoning models, agents, world models, alignment — and asks which of the unsolved problems each of the earlier lessons actually bears on.
Where we are: RL is the engine of reasoning and agency
For most of this course RL was a way to train a game-player or a robot. As of the mid-2020s it is the thing that turned a next-token predictor into a reasoner. The recipe is short and you already know every piece of it: take a pretrained language model as the policy πθ, let it generate a chain of thought, score the final answer with a verifier or a reward model, and push probability mass toward the attempts that scored well. That is the minimum viable loop from the sibling course's lesson 01, run at scale.
The algorithms doing the pushing are exactly the ones from lessons 14–16: PPO's clipped surrogate, GRPO's group-baseline advantage, DPO's closed-form preference loss, all anchored to a reference policy by a KL term. The "actions" are tokens; the "episode" is a completion; the "advantage" is whether this completion beat its siblings. Nothing in the math changed — the policy just got large enough that the resulting behavior looks like thinking.
Agents extend this one more step. A reasoning model that can call tools, read results, and act again is just an MDP whose actions include "emit text" and "call a function," whose state includes the tool outputs, and whose reward is task success. Multi-turn, tool-using agents are the sequential-decision setting of lesson 01 with a language model as the policy — which is why every variance, trust, and reward problem we studied comes back, harder.
The synthesis: value × policy × model
The orientation promised one fork. Standing at the frontier, here is the whole map on one line, the way you should now read any new method:
Every method in this course is a point in that three-axis space, plus an answer to where does the reward come from and how do we keep the update safe:
| Axis | Pure form | The reunion |
|---|---|---|
| Value | Q-learning, DQN (02, 06) | Actor–Critic: a policy that learns, with a value that critiques it to cut variance (03, 08), made safe by a trust region (10) and scaled to LLMs (11–13). |
| Policy | REINFORCE, PG (03, 07) | |
| Model | DP, MCTS, world models (04) | Plan with a learned model; AlphaZero = policy + value + search. |
Hold that table next to the open problems below. Each problem is a place where one axis is still weak.
Interactive · the open-problems map
Click an open problem on the left. The map highlights the lessons and methods that bear on it, and the panel explains why — and what is still missing. The point of the exercise: no frontier problem is new. Each is a known tension from earlier in the course, not yet resolved at scale.
The open problems, each tied to a lesson
1 · Sample efficiency (04 · 15 · 16). Policy gradients are sample-hungry: on-policy methods (lesson 09) throw data away every step, and a frontier RL run can burn millions of rollouts. The two known levers are both in this course — learn a model and plan inside it (model-based RL, lesson 07, taken to scale below as world models), and reuse fixed data (off-policy and offline RL, lessons 18–19). Neither is solved; learned models drift, offline data has gaps.
2 · Reward specification — the hardest problem (13 · 14). Every algorithm in this course assumed a reward existed. The orientation called this the signal question, and it is where the most lessons converge because it is genuinely unsolved. When humans cannot score outputs directly, we learn a reward model from preferences (RLHF, lesson 16) or recover it from demonstrations (inverse RL, lesson 17) — but a learned reward is a proxy, and policies reward-hack the proxy. "Where does the reward come from, and is it the reward we meant?" is the question that does not go away. This is why reward specification lights up the most nodes in the map above.
3 · Safety & alignment (13 · 16–18). Two ideas from earlier lessons are the current alignment toolkit. The KL-to-reference anchor (lesson 16) keeps the trained policy from drifting away from a trusted base while it chases reward — the same trust-region instinct as TRPO/PPO, now used to bound how strange a model is allowed to become. And conservatism from offline RL (lessons 19–21, CQL's pessimism, BCQ's data constraint) is the principle "don't act confidently outside what you have evidence for" — directly relevant to a model that must not take unsupported actions.
4 · Generalization (08 · 22 · 16). An advantage estimate that is right on the training distribution can be badly wrong off it (the bias–variance story of lesson 11), and a policy trained in simulation overfits to it (the sim-to-real gap, lesson 57). Offline RL's distributional-shift problem (lesson 19) is the same disease: the value function extrapolates into states it never saw. Robust generalization across distribution shift is open.
5 · Multi-agent & non-stationarity (23). Every method assumed a stationary environment. The moment another learning agent is present — autonomous driving among other drivers (lesson 58), markets full of other traders (lesson 59) — the environment is non-stationary because everyone is adapting at once. Exploration (lesson 08) and equilibrium learning get much harder; convergence guarantees mostly evaporate.
World models: lesson 07, taken to scale
Model-based RL in lesson 07 was modest: know P(s′|s,a) and R(s,a), then plan with dynamic programming or MCTS. The frontier version learns a high-capacity simulator of the world — a world model — and trains the policy mostly inside that learned dream, querying the real environment only to keep the model honest. This is the direct route at sample efficiency: imagined rollouts are free. The open risk is the same one lesson 07 hinted at and lesson 19 made precise — the policy will exploit wherever the learned model is wrong.
Human-AI collaboration: RL as a feedback loop with us in it
RLHF (lesson 16) is usually described as an algorithm, but it is better seen as a collaboration protocol: the human supplies evaluative feedback (the only thing RL ever needed — see the orientation), the model proposes, the human or a learned proxy judges, and the policy updates. The loop is the agent–environment loop of lesson 01 with a human partly playing the environment. The frontier is making that loop richer — feedback on reasoning steps, not just final answers; on-line correction; the model asking for help when uncertain. Human-AI collaboration is RLHF generalized into a standing partnership rather than a one-time training phase.
Reading the whole arc backward
From the frontier, the course reads as one argument. The fork (value vs policy, lessons 01–06) was scaled with deep nets (06–07), its variance tamed with advantages and GAE (08), its updates made safe with importance sampling and trust regions (09–10), then handed to language models as PPO/DPO/GRPO/RLHF (11–13). Where the reward itself was missing, imitation and inverse RL stepped in (14). Where the world was continuous, the simulator dangerous, or interaction impossible, continuous control and offline RL stepped in (15–18). The applications (19–30) were that machine pointed at real domains. And the frontier — reasoning, agents, world models, alignment — is the same machine, the same open tensions, at a scale where the tensions decide whether the system is trustworthy.