Value-based — from Q-learning to DQN

Lesson 01 left us with the Bellman expectation equation — which needs a policy plugged in. One small change, swapping the average for a max, removes the policy entirely and lets us learn the optimal value Q* directly. That change is the whole of value-based RL.

The delta from lesson 01

Lesson 01 defined the action-value of a fixed policy π and gave its Bellman expectation equation — value of an action = immediate reward plus the discounted value of whatever the policy does next:

Q^π(s,a) = 𝔼_s′ [ R(s,a) + γ · 𝔼_a′∼π Q^π(s′,a′) ]

That inner expectation 𝔼_a′∼π is the catch: it bakes in a policy you must already have. To evaluate a policy, fine. But our goal (lesson 01) is the optimal policy π*, and we'd like to find it without first guessing one.

Here is the one idea. The optimal policy is greedy: in any state it picks the action with the largest value. So in the equation above, replace "what the policy does next" with "the best action next." The average over a′ becomes a max over a′, and the policy vanishes:

Q*(s,a) = 𝔼_s′ [ R(s,a) + γ · max_a′ Q*(s′,a′) ]

This is the Bellman optimality equation. It is self-referential — Q* appears on both sides — but it no longer mentions any policy. If we can solve it, the optimal policy is just π*(s) = argmax_a Q*(s,a). That last line is the fork from orientation in action: we never learn a policy object. We learn a number, and the policy falls out of it by greedy choice.

Expectation vs optimality, in one line

Expectation (lesson 01): "how good is acting under this policy?" — uses 𝔼_a′∼π. Optimality (this lesson): "how good is acting optimally?" — uses max_a′. The max is the entire move.

From an equation we can't solve to an update we can run

We don't know the transition dynamics or the reward function, so we can't take that expectation 𝔼_s′ analytically. But we don't have to. We act in the world and observe one real transition (s, a, r, s′) — one sample of what was inside the expectation. That sample suggests a target for Q(s,a):

target = r + γ · max_a′ Q(s′,a′)

Our current estimate Q(s,a) is probably wrong. The gap between the target and the estimate is the temporal-difference (TD) error:

δ = r + γ · max_a′ Q(s′,a′) − Q(s,a)

If δ > 0, the action was better than we thought — nudge the estimate up. If δ < 0, nudge it down. The size of the nudge is set by a learning rate α ∈ (0,1]. That is TD(0) learning, and when the target uses the max it is exactly Q-learning:

Q(s,a) ← Q(s,a) + α [ r + γ · max_a′ Q(s′,a′) − Q(s,a) ]

The phrase "use your own current estimate Q(s′,·) inside the target" is called bootstrapping: you learn a guess from a guess. It is what lets you update after a single step instead of waiting for the episode to end — and, as we'll see, it is also the first ingredient of the bug that ends this lesson.

Off-policy: behave one way, learn another

There's a subtlety hiding in that max. To learn we must visit states and try actions — we must explore. But the target assumes we act greedily next (the max). So the agent behaves one way and learns the value of behaving a different way. Concretely, behavior is ε-greedy:

with probability ε:   pick a random action          (explore)
otherwise:            pick argmax_a Q(s,a)           (exploit)

…while the target always uses max_a′ Q(s′,a′) — the greedy value, regardless of the random action exploration actually took. Learning a target policy (greedy) different from the behavior policy (ε-greedy) is what off-policy means. It's a feature: you can explore wildly and still converge to Q*, because the max in the target keeps aiming at the optimal policy no matter how you collected the data.

Why this connects to the fork

Q-learning never represents a policy. The greedy rule argmax_a Q(s,a) is the implicit policy read off the value at the moment you need to act. Hold this thought: lesson 03 is what happens when that argmax becomes impossible to compute, forcing us to store a policy explicitly.

Where the table breaks

So far Q is a lookup table: one cell per (s,a) pair. That's fine for a 5×5 gridworld with 4 actions — 100 cells. It is hopeless for Atari from pixels (a state is an image: more configurations than atoms in the universe) or for a robot's continuous sensors. You will never visit most states even once, and a table can't say anything about a state it hasn't seen.

The fix is function approximation: replace the table with a parameterized function Q(s,a;θ) — a neural network — that generalizes from visited states to similar unseen ones. Train it by gradient descent on the squared TD error, treating the target as a regression label:

L(θ) = 𝔼 [ ( r + γ · max_a′ Q(s′,a′;θ) − Q(s,a;θ) )² ]

This is DQN in spirit — the network that learned to play Atari from pixels. But written naively this loss is unstable, and the reason is deep enough that it has a name.

The deadly triad

Three ingredients, each individually fine, become explosive together:

Ingredient	What it is	Where we use it
Bootstrapping	target uses your own estimate Q(s′,·)	the γ max Q term
Off-policy	learn the greedy policy while behaving ε-greedy	the max vs ε-greedy behavior
Function approximation	a network that ties all states' values together	Q(s,a;θ)

The trouble: the regression label r + γ max_a′ Q(s′,a′;θ) depends on the very θ you are updating. You chase a moving target. Because the network generalizes, raising Q at s also raises it at the similar state s′ — which raises the target — which raises Q(s) again. The feedback loop can spiral Q to infinity. This is the deadly triad (Sutton & Barto): bootstrapping + off-policy + function approximation can diverge even on tiny problems.

DQN tames it with two devices, both of which simply break the feedback loop:

Experience replay. Store transitions (s,a,r,s′) in a buffer and train on random minibatches from it. Consecutive samples in RL are highly correlated (you walk through nearby states); shuffling the buffer decorrelates them so the network sees something closer to i.i.d. data, the regime SGD was built for.
Target network. Keep a frozen copy of the weights, θ⁻, and compute the target from that: r + γ max_a′ Q(s′,a′;θ⁻). Now the label stops moving while you fit it. Every N steps you copy θ → θ⁻. The target is stale, but stable — and stable beats fast here.

L(θ) = 𝔼_{(s,a,r,s′)∼buffer} [ ( r + γ · max_a′ Q(s′,a′;θ⁻) − Q(s,a;θ) )² ]

Read the change: the target uses the frozen θ⁻, the prediction uses the live θ. That single decoupling is the difference between a network that learns and one that diverges. The widget below lets you turn it off and watch the triad bite.

Interactive · gridworld Q-learning, with the triad bug on a switch

A 5×5 gridworld. The agent (blue) starts top-left; the goal (green ★) is bottom-right with reward +1; the red pit gives −1; every step costs a small −0.02 to encourage short paths. Each cell shows its greedy value as a shaded patch and the greedy action as an arrow. Press Train and watch the arrows organize into a path to the goal as Q converges.

The Target network toggle is the lesson. With it on, learning is stable. Turn it off and crank α up: to mimic function approximation the demo lets each update "leak" into neighboring cells (generalization), recreating the deadly triad on a table — the values blow up and the policy turns to noise. The bug is the lesson.

Gridworld Q-learning — values, greedy arrows, and the deadly triad

Train to watch Q converge and arrows point home. Toggle the target network OFF (and raise α) to make the bootstrapped, generalizing, off-policy update diverge.

α: 0.30 ε: 0.20 Target network (stable)

Updates

Max |Q|

0.00

Greedy success

—

Status

ready

Show the core Q-learning update (≈20 lines)

// one off-policy TD(0) / Q-learning step from state s
function qStep(s) {
  const a = (Math.random() < eps) ? randAction()        // explore
                                   : argmaxA(Q[s]);       // exploit (greedy)
  const [s2, r, done] = env(s, a);                        // observe (s,a,r,s')
  // target uses frozen weights when the target net is ON, else live Q
  const Qnext = targetOn ? Qfrozen[s2] : Q[s2];
  const target = r + (done ? 0 : gamma * Math.max(...Qnext));   // bootstrap + max
  const delta  = target - Q[s][a];                              // TD error
  Q[s][a] += alpha * delta;                                     // the update
  return done ? START : s2;
}
// every N steps, refresh the frozen copy (target network)
if (targetOn && updates % N === 0) Qfrozen = clone(Q);

What this branch can't do — and where lesson 03 starts

Every step of Q-learning depends on max_a′ Q(s′,a′) and argmax_a Q(s,a). Computing those means enumerating every action. For a robot whose action is a continuous torque vector, or a recommender choosing among millions of items, you cannot enumerate — the argmax is itself an intractable optimization. Value-based methods also can only ever produce a deterministic greedy policy; they can't represent an optimal policy that is genuinely stochastic.

So we take the other branch of the fork. Instead of learning a value and reading the policy off it, we parameterize the policy directly, π_θ(a|s), and push its parameters toward high return — no max, no argmax, continuous actions welcome. That is the policy gradient, and it leads straight back to the value function as a variance-reducing critic. Next lesson.

Takeaway

Swap the policy-averaging in Bellman expectation for a max and you get Bellman optimality, which you can learn directly by the Q-learning update Q ← Q + α[r + γ max_a′ Q′ − Q] — bootstrapping a guess from a guess while behaving ε-greedy (off-policy). Scale it with a neural net and the bootstrapping + off-policy + function-approximation deadly triad makes it diverge; experience replay (decorrelate) and a frozen target network (stop the moving target) tame it — that, plus the implicit greedy policy, is DQN. The catch that opens lesson 03: argmax_a dies on continuous or huge action spaces.