Frontiers & outlook — from AGI to human-AI collaboration

Thirty lessons built one machine: an MDP, two ways to solve it, and a long campaign to scale, stabilize, and source the reward. This lesson points that machine at the open frontier — reasoning models, agents, world models, alignment — and asks which of the unsolved problems each of the earlier lessons actually bears on.

Where we are: RL is the engine of reasoning and agency

For most of this course RL was a way to train a game-player or a robot. As of the mid-2020s it is the thing that turned a next-token predictor into a reasoner. The recipe is short and you already know every piece of it: take a pretrained language model as the policy π_θ, let it generate a chain of thought, score the final answer with a verifier or a reward model, and push probability mass toward the attempts that scored well. That is the minimum viable loop from the sibling course's lesson 01, run at scale.

The algorithms doing the pushing are exactly the ones from lessons 14–16: PPO's clipped surrogate, GRPO's group-baseline advantage, DPO's closed-form preference loss, all anchored to a reference policy by a KL term. The "actions" are tokens; the "episode" is a completion; the "advantage" is whether this completion beat its siblings. Nothing in the math changed — the policy just got large enough that the resulting behavior looks like thinking.

The hand-off, restated

This theory course ends where the RL Post-Training systems course begins. There you will find the same PPO / GRPO / RLHF machinery as it actually runs on a GPU cluster — PPO, GRPO, RLHF, DPO — rollout engines, weight sync, kernels. Reasoning models and agents are RL post-training; that is the dominant frontier today.

Agents extend this one more step. A reasoning model that can call tools, read results, and act again is just an MDP whose actions include "emit text" and "call a function," whose state includes the tool outputs, and whose reward is task success. Multi-turn, tool-using agents are the sequential-decision setting of lesson 01 with a language model as the policy — which is why every variance, trust, and reward problem we studied comes back, harder.

The synthesis: value × policy × model

The orientation promised one fork. Standing at the frontier, here is the whole map on one line, the way you should now read any new method:

solve an MDP ⇒ learn a value (how good?) × learn a policy (what to do?) × learn a model (what happens next?)

Every method in this course is a point in that three-axis space, plus an answer to where does the reward come from and how do we keep the update safe:

Axis	Pure form	The reunion
Value	Q-learning, DQN (02, 06)	Actor–Critic: a policy that learns, with a value that critiques it to cut variance (03, 08), made safe by a trust region (10) and scaled to LLMs (11–13).
Policy	REINFORCE, PG (03, 07)
Model	DP, MCTS, world models (04)	Plan with a learned model; AlphaZero = policy + value + search.

Hold that table next to the open problems below. Each problem is a place where one axis is still weak.

Interactive · the open-problems map

Click an open problem on the left. The map highlights the lessons and methods that bear on it, and the panel explains why — and what is still missing. The point of the exercise: no frontier problem is new. Each is a known tension from earlier in the course, not yet resolved at scale.

Open problems × the lessons that bear on them

Pick a problem. Lit-up nodes are the lessons whose ideas attack it; the brightest node is where the problem is sharpest. Reward specification lights up the most — that is the tell.

Problem

—

Lessons that bear on it

Hardest because

—

Pick a problem above

Each open problem traces back to a tension we already named. Click one to see which lessons hold the partial answers — and where the answer runs out.

Show the JS that runs this widget (≈25 lines)

// Each problem maps to the lesson-node ids that bear on it, plus the "core" node.
const PROBLEMS = {
  sample_eff: { core: 'L04', nodes: ['L04','L15','L16'] },
  reward:     { core: 'L13', nodes: ['L13','L14','L05','L29'] },  // the hardest: lights up most
  safety:     { core: 'L13', nodes: ['L13','L16','L17','L18'] },
  general:    { core: 'L08', nodes: ['L08','L22','L16'] },
  multiagent: { core: 'L23', nodes: ['L23','L05','L24'] },
};
function pick(key){
  const p = PROBLEMS[key];
  for (const id in NODES) NODES[id].on = p.nodes.includes(id);
  for (const id in NODES) NODES[id].core = (id === p.core);
  draw();
}

The open problems, each tied to a lesson

1 · Sample efficiency (04 · 15 · 16). Policy gradients are sample-hungry: on-policy methods (lesson 09) throw data away every step, and a frontier RL run can burn millions of rollouts. The two known levers are both in this course — learn a model and plan inside it (model-based RL, lesson 07, taken to scale below as world models), and reuse fixed data (off-policy and offline RL, lessons 18–19). Neither is solved; learned models drift, offline data has gaps.

2 · Reward specification — the hardest problem (13 · 14). Every algorithm in this course assumed a reward existed. The orientation called this the signal question, and it is where the most lessons converge because it is genuinely unsolved. When humans cannot score outputs directly, we learn a reward model from preferences (RLHF, lesson 16) or recover it from demonstrations (inverse RL, lesson 17) — but a learned reward is a proxy, and policies reward-hack the proxy. "Where does the reward come from, and is it the reward we meant?" is the question that does not go away. This is why reward specification lights up the most nodes in the map above.

3 · Safety & alignment (13 · 16–18). Two ideas from earlier lessons are the current alignment toolkit. The KL-to-reference anchor (lesson 16) keeps the trained policy from drifting away from a trusted base while it chases reward — the same trust-region instinct as TRPO/PPO, now used to bound how strange a model is allowed to become. And conservatism from offline RL (lessons 19–21, CQL's pessimism, BCQ's data constraint) is the principle "don't act confidently outside what you have evidence for" — directly relevant to a model that must not take unsupported actions.

4 · Generalization (08 · 22 · 16). An advantage estimate that is right on the training distribution can be badly wrong off it (the bias–variance story of lesson 11), and a policy trained in simulation overfits to it (the sim-to-real gap, lesson 57). Offline RL's distributional-shift problem (lesson 19) is the same disease: the value function extrapolates into states it never saw. Robust generalization across distribution shift is open.

5 · Multi-agent & non-stationarity (23). Every method assumed a stationary environment. The moment another learning agent is present — autonomous driving among other drivers (lesson 58), markets full of other traders (lesson 59) — the environment is non-stationary because everyone is adapting at once. Exploration (lesson 08) and equilibrium learning get much harder; convergence guarantees mostly evaporate.

World models: lesson 07, taken to scale

Model-based RL in lesson 07 was modest: know P(s′|s,a) and R(s,a), then plan with dynamic programming or MCTS. The frontier version learns a high-capacity simulator of the world — a world model — and trains the policy mostly inside that learned dream, querying the real environment only to keep the model honest. This is the direct route at sample efficiency: imagined rollouts are free. The open risk is the same one lesson 07 hinted at and lesson 19 made precise — the policy will exploit wherever the learned model is wrong.

The recurring villain

Notice how often "the model / value / reward is a proxy, and the policy exploits the gap" recurs: reward hacking (13), OOD extrapolation in offline RL (16), sim-to-real overfit (22), world-model exploitation. It is one phenomenon. A policy optimizes exactly what you wrote down — never what you meant. Most frontier RL safety work is some answer to this single sentence.

Human-AI collaboration: RL as a feedback loop with us in it

RLHF (lesson 16) is usually described as an algorithm, but it is better seen as a collaboration protocol: the human supplies evaluative feedback (the only thing RL ever needed — see the orientation), the model proposes, the human or a learned proxy judges, and the policy updates. The loop is the agent–environment loop of lesson 01 with a human partly playing the environment. The frontier is making that loop richer — feedback on reasoning steps, not just final answers; on-line correction; the model asking for help when uncertain. Human-AI collaboration is RLHF generalized into a standing partnership rather than a one-time training phase.

Reading the whole arc backward

From the frontier, the course reads as one argument. The fork (value vs policy, lessons 01–06) was scaled with deep nets (06–07), its variance tamed with advantages and GAE (08), its updates made safe with importance sampling and trust regions (09–10), then handed to language models as PPO/DPO/GRPO/RLHF (11–13). Where the reward itself was missing, imitation and inverse RL stepped in (14). Where the world was continuous, the simulator dangerous, or interaction impossible, continuous control and offline RL stepped in (15–18). The applications (19–30) were that machine pointed at real domains. And the frontier — reasoning, agents, world models, alignment — is the same machine, the same open tensions, at a scale where the tensions decide whether the system is trustworthy.

Takeaway

RL did not get replaced at the frontier — it became the engine of reasoning and agency, and its oldest tensions became the headline problems. Sample efficiency is still lesson 07's model question; alignment is still the KL anchor and conservatism of lessons 16 and 16–18; and reward specification — "where does the reward come from, and is it the one we meant?" — is the single hardest problem because a policy optimizes exactly what you wrote down, never what you intended. You now hold the whole map: value × policy × model, made safe, sourced its reward, and pointed at the open frontier.