rl_foundations / lessons / 04 · planning lesson 04 / 32

Model & planning — from dynamic programming to MCTS

Lessons 02 and 03 learn by sampling — bootstrapping from transitions, averaging over rollouts — because the model is unknown. But if you know the model P(s'|s,a) and R(s,a) (a board game's rules, a simulator, or a model you've learned), you don't have to wait for experience. You can plan.

What the sampling methods were really doing

Recall the common thread from the last two lessons. Q-learning updated Q ← Q + α(r + γ maxa' Q(s',a') − Q) from observed transitions; REINFORCE averaged returns over sampled trajectories. Both sample for one reason: they don't have the MDP's two functions in hand —

Sampling is the workaround for not knowing P and R. When you do know them — chess and Go have explicit rules, a physics simulator returns exact next states, a robotics model can be fit from data — a whole different toolkit opens up. You can compute the answer the Bellman equations define, instead of nudging toward it one noisy sample at a time. This is the model-based world, and its first tool is dynamic programming.

Dynamic programming: the ground-truth Bellman solver

Lesson 01 wrote the Bellman optimality equation. With a known model it is not just a description of V* — it is an algorithm. Sweep it over every state until it stops changing. That is value iteration:

Vk+1(s)  ←  maxa [ R(s,a) + γ ∑s' P(s'|s,a) · Vk(s') ]

Read it literally: the new value of a state is the best action's immediate reward plus the discounted, probability-weighted value of where that action lands you. Repeat the sweep; each pass propagates value one step further outward from the rewards. The iteration is a contraction — every sweep shrinks the gap to the true V* by a factor of γ — so it converges to the unique fixed point V*. Read the greedy action off V* and you have π*.

Its sibling, policy iteration, alternates two exact steps:

policy iteration:
    1. POLICY EVALUATION   solve V^π for the current π   (the Bellman EXPECTATION equation, lesson 01)
    2. POLICY IMPROVEMENT  π'(s) ← argmax_a [ R(s,a) + γ Σ_s' P(s'|s,a) V^π(s') ]
    repeat until π stops changing  ->  π = π*

Both are guaranteed to reach the optimum, in a finite number of steps for policy iteration. Note the key word in the update: that s' P(s'|s,a) is the exact expectation over next states. TD learning in lesson 02 replaced that sum with a single sampled s' — bootstrapping is precisely "do the DP backup, but with one Monte-Carlo sample instead of the full model-weighted sum."

The relationship to remember
Dynamic programming is the ground truth that temporal-difference learning approximates. The TD target r + γ V(s') is a one-sample stand-in for the DP backup's full expectation R(s,a) + γ ∑s' P(s'|s,a) V(s'). Model-free RL exists because we usually can't afford — or don't possess — the model that DP requires.

Why we can't always just run DP: the curse of dimensionality

Value iteration sweeps every state, every sweep. That is fine for a 4×4 gridworld. It is hopeless almost everywhere else, because real state spaces are astronomically large:

So DP gives the exact answer but cannot be run; sampling methods scale but throw away the model we might actually have. The synthesis is sample-based planning: keep the model, but instead of sweeping all states, use the model to look ahead only along the trajectories that matter — the ones a good policy would actually visit. Spend computation where it counts.

Monte-Carlo Tree Search: planning by selective lookahead

MCTS builds a search tree rooted at the current state. The root's children are the actions; their children are the resulting states, and so on. It does not expand the whole tree (that is the curse again). It grows the tree asymmetrically, pouring simulations down the most promising lines while still occasionally probing neglected ones. Each simulation is four steps:

  1. Selection. From the root, walk down the existing tree, at each node picking the child that maximizes a score that balances looking good so far against barely explored (the UCT rule below) — until you reach a node with an unexpanded action.
  2. Expansion. Add one new child node for an untried action — the tree grows by one leaf.
  3. Simulation / rollout. From that new leaf, play out to the end with a cheap default policy (often random, or a learned one) and record the outcome. Using the model to play forward is the whole point — no real-world interaction needed.
  4. Backup. Propagate that outcome back up the path: every node on it increments its visit count N and updates its mean value Q.

Run this thousands of times; the visit counts at the root concentrate on the best action, and you play it. The genius is in step 1 — how to choose which child to descend into.

UCT: the same explore/exploit math you'll meet next lesson

At a node visited N times, child i has estimated value Qi (exploit: how good it has looked) and visit count ni (how sure we are). UCT — Upper Confidence bounds applied to Trees — selects the child maximizing:

UCT(i)  =  Qi  +  c · √( ln N / ni )

The first term exploits — favor children that have scored well. The second term explores — it is large for children with small ni (rarely tried) and shrinks as ni grows; the ln N on top keeps every child eventually getting revisited. The constant c sets the trade-off.

Forward reference — this bonus is not new
That c · √(ln N / ni) exploration bonus is literally the UCB1 formula from the multi-armed bandit, applied recursively at every tree node — which is why it's called Upper Confidence bounds for Trees. Lesson 05 strips this dilemma down to a single state (no tree) to study it in its purest form. When you see UCB next lesson, recognize it: you already used it here.

Set c too small and UCT becomes nearly greedy: it pours every simulation down whatever line looked best first, never auditing alternatives, and gets stuck on a myopic, often wrong, move. Set c too large and it spreads simulations almost uniformly, refusing to commit — its budget is wasted re-checking obviously bad lines instead of deepening the good one. There is a sweet spot, and the widget below lets you feel both failure modes.

The model-based ↔ model-free spectrum

Model-free (02, 03)Model-based (this lesson)
Needs P, R?No — learns from samplesYes — given, or learned, then used
Sample efficiencyLow (must experience everything)High (reuse the model to imagine)
Cost per decisionCheap (look up Q or π)Expensive (plan/search at decision time)
Fails when…Interaction is scarce/dangerousThe model is wrong (errors compound)

It is a spectrum, not a binary. DP sits at the pure model-based end (full sweep). MCTS sits in the middle: it has the model but only simulates selectively. And when the true model is unavailable, you can learn one from data and plan inside it — model-based RL, which reappears for sample-efficient robotics in lesson 22.

AlphaGo / AlphaZero: the fork and the search, fused

This lesson's punchline is that the strongest game-playing systems are not "value-based" or "policy-based" or "search" — they are all three at once, and you have now met every ingredient:

POLICY net πθ prunes the search — which moves to try (L03) VALUE net Vφ replaces the rollout — scores a leaf instantly (L02) MCTS (UCT) looks ahead in the model — turns both into a move (L04) AlphaZero: self-play search produces better moves than the raw nets, which become training targets for the nets — search and learning bootstrap each other.

The policy network biases selection toward sensible moves (so UCT doesn't waste its bonus on nonsense). The value network supplies the leaf estimate, replacing the random rollout in step 3 with a learned V. MCTS then searches the known game model and outputs a move stronger than either net alone. The search results train the nets; the better nets sharpen the next search. It is the value branch (02), the policy branch (03), and planning (04) welded together by a known model.

Interactive · MCTS tree growth and the UCT constant c

A toy decision tree: the root has several actions, each leading to a subtree whose leaves hide a true value. One branch is genuinely best, but a decoy branch looks slightly better on the very first random rollout. Click Simulate to run MCTS steps — selection, expansion, rollout, backup — and watch the tree grow (node size ∝ visits, fill ∝ estimated Q). The KPI tells you whether the root is currently committing to the truly-best action.

MCTS — grow the tree, tune the UCT exploration constant c
Drag c to the far left (≈0): nearly greedy — it locks onto the decoy it saw first and never recovers (the bug is the lesson). Far right (large): it spreads simulations almost uniformly and is slow to commit. The middle finds the best action with the fewest simulations.
Simulations
0
Tree nodes
1
Root's choice
Optimal?
Show the core MCTS loop (≈24 lines)
// one MCTS simulation = select -> expand -> rollout -> backup
function simulate(root, c){
  let node = root, path = [root];
  // 1. SELECTION: descend by UCT until a node with untried children
  while (node.children.length && node.untried.length === 0){
    node = bestUCT(node, c);            // argmax_i  Q_i + c*sqrt(ln N / n_i)
    path.push(node);
  }
  // 2. EXPANSION: add one untried child
  if (node.untried.length){
    const a = node.untried.pop();
    node = node.addChild(a);
    path.push(node);
  }
  // 3. ROLLOUT: cheap playout from the new leaf, return its outcome
  const value = rollout(node);          // here: noisy sample of the leaf's true value
  // 4. BACKUP: update visits N and mean value Q along the path
  for (const nd of path){ nd.N += 1; nd.W += value; nd.Q = nd.W / nd.N; }
}
function bestUCT(node, c){
  const N = node.N;
  return node.children.reduce((best, ch) =>
    (ch.Q + c*Math.sqrt(Math.log(N)/ch.n)) >
    (best.Q + c*Math.sqrt(Math.log(N)/best.n)) ? ch : best);
}

The takeaway the widget hammers home: at small c, the second term of UCT is near zero, so selection is pure exploitation. One unlucky early rollout that flatters the decoy becomes a self-fulfilling prophecy — every later simulation re-confirms it and the root commits to the wrong move. The exploration bonus exists precisely to keep auditing the alternatives until the truth surfaces. That is the explore/exploit dilemma, and it deserves a lesson of its own.

Where this points next

MCTS leaned entirely on one mechanism — UCT's c · √(ln N / n) bonus — to decide, at every node, whether to exploit the best-looking child or explore an under-tried one. That single tension shows up in every RL method (it's why lesson 02 used ε-greedy too), but here it was tangled up inside a whole tree.

To understand it cleanly, strip everything else away: no transitions, no tree, no credit assignment over time — just one state and a row of slot-machine arms, each with an unknown payoff. That is the multi-armed bandit, the purest laboratory for exploration vs exploitation. There you'll meet UCB (the parent of UCT), ε-greedy, and Thompson sampling head to head. Lesson 05.

Takeaway
If you know the model P, R you can plan instead of sample. Dynamic programming (value/policy iteration) sweeps the Bellman equations to the exact V* — it is the ground truth TD approximates with single samples — but the curse of dimensionality forbids sweeping huge state spaces. Sample-based planning fixes this: MCTS grows a tree selectively, using UCT = Q + c·√(ln N / n) (the same UCB bonus as lesson 05) to balance exploit vs explore. AlphaZero fuses the policy net (03), the value net (02), and MCTS search (04) into one system where search and learning bootstrap each other.