Value-based — from Q-learning to DQN
Lesson 01 left us with the Bellman expectation equation — which needs a policy plugged in. One small change, swapping the average for a max, removes the policy entirely and lets us learn the optimal value Q* directly. That change is the whole of value-based RL.
The delta from lesson 01
Lesson 01 defined the action-value of a fixed policy π and gave its Bellman expectation equation — value of an action = immediate reward plus the discounted value of whatever the policy does next:
That inner expectation 𝔼a′∼π is the catch: it bakes in a policy you must already have. To evaluate a policy, fine. But our goal (lesson 01) is the optimal policy π*, and we'd like to find it without first guessing one.
Here is the one idea. The optimal policy is greedy: in any state it picks the action with the largest value. So in the equation above, replace "what the policy does next" with "the best action next." The average over a′ becomes a max over a′, and the policy vanishes:
This is the Bellman optimality equation. It is self-referential — Q* appears on both sides — but it no longer mentions any policy. If we can solve it, the optimal policy is just π*(s) = argmaxa Q*(s,a). That last line is the fork from orientation in action: we never learn a policy object. We learn a number, and the policy falls out of it by greedy choice.
From an equation we can't solve to an update we can run
We don't know the transition dynamics or the reward function, so we can't take that expectation 𝔼s′ analytically. But we don't have to. We act in the world and observe one real transition (s, a, r, s′) — one sample of what was inside the expectation. That sample suggests a target for Q(s,a):
Our current estimate Q(s,a) is probably wrong. The gap between the target and the estimate is the temporal-difference (TD) error:
If δ > 0, the action was better than we thought — nudge the estimate up. If δ < 0, nudge it down. The size of the nudge is set by a learning rate α ∈ (0,1]. That is TD(0) learning, and when the target uses the max it is exactly Q-learning:
The phrase "use your own current estimate Q(s′,·) inside the target" is called bootstrapping: you learn a guess from a guess. It is what lets you update after a single step instead of waiting for the episode to end — and, as we'll see, it is also the first ingredient of the bug that ends this lesson.
Off-policy: behave one way, learn another
There's a subtlety hiding in that max. To learn we must visit states and try actions — we must explore. But the target assumes we act greedily next (the max). So the agent behaves one way and learns the value of behaving a different way. Concretely, behavior is ε-greedy:
with probability ε: pick a random action (explore) otherwise: pick argmax_a Q(s,a) (exploit)
…while the target always uses maxa′ Q(s′,a′) — the greedy value, regardless of the random action exploration actually took. Learning a target policy (greedy) different from the behavior policy (ε-greedy) is what off-policy means. It's a feature: you can explore wildly and still converge to Q*, because the max in the target keeps aiming at the optimal policy no matter how you collected the data.
Where the table breaks
So far Q is a lookup table: one cell per (s,a) pair. That's fine for a 5×5 gridworld with 4 actions — 100 cells. It is hopeless for Atari from pixels (a state is an image: more configurations than atoms in the universe) or for a robot's continuous sensors. You will never visit most states even once, and a table can't say anything about a state it hasn't seen.
The fix is function approximation: replace the table with a parameterized function Q(s,a;θ) — a neural network — that generalizes from visited states to similar unseen ones. Train it by gradient descent on the squared TD error, treating the target as a regression label:
This is DQN in spirit — the network that learned to play Atari from pixels. But written naively this loss is unstable, and the reason is deep enough that it has a name.
The deadly triad
Three ingredients, each individually fine, become explosive together:
| Ingredient | What it is | Where we use it |
|---|---|---|
| Bootstrapping | target uses your own estimate Q(s′,·) | the γ max Q term |
| Off-policy | learn the greedy policy while behaving ε-greedy | the max vs ε-greedy behavior |
| Function approximation | a network that ties all states' values together | Q(s,a;θ) |
The trouble: the regression label r + γ maxa′ Q(s′,a′;θ) depends on the very θ you are updating. You chase a moving target. Because the network generalizes, raising Q at s also raises it at the similar state s′ — which raises the target — which raises Q(s) again. The feedback loop can spiral Q to infinity. This is the deadly triad (Sutton & Barto): bootstrapping + off-policy + function approximation can diverge even on tiny problems.
DQN tames it with two devices, both of which simply break the feedback loop:
- Experience replay. Store transitions (s,a,r,s′) in a buffer and train on random minibatches from it. Consecutive samples in RL are highly correlated (you walk through nearby states); shuffling the buffer decorrelates them so the network sees something closer to i.i.d. data, the regime SGD was built for.
- Target network. Keep a frozen copy of the weights, θ−, and compute the target from that: r + γ maxa′ Q(s′,a′;θ−). Now the label stops moving while you fit it. Every N steps you copy θ → θ−. The target is stale, but stable — and stable beats fast here.
Read the change: the target uses the frozen θ−, the prediction uses the live θ. That single decoupling is the difference between a network that learns and one that diverges. The widget below lets you turn it off and watch the triad bite.
Interactive · gridworld Q-learning, with the triad bug on a switch
A 5×5 gridworld. The agent (blue) starts top-left; the goal (green ★) is bottom-right with reward +1; the red pit gives −1; every step costs a small −0.02 to encourage short paths. Each cell shows its greedy value as a shaded patch and the greedy action as an arrow. Press Train and watch the arrows organize into a path to the goal as Q converges.
The Target network toggle is the lesson. With it on, learning is stable. Turn it off and crank α up: to mimic function approximation the demo lets each update "leak" into neighboring cells (generalization), recreating the deadly triad on a table — the values blow up and the policy turns to noise. The bug is the lesson.
What this branch can't do — and where lesson 03 starts
Every step of Q-learning depends on maxa′ Q(s′,a′) and argmaxa Q(s,a). Computing those means enumerating every action. For a robot whose action is a continuous torque vector, or a recommender choosing among millions of items, you cannot enumerate — the argmax is itself an intractable optimization. Value-based methods also can only ever produce a deterministic greedy policy; they can't represent an optimal policy that is genuinely stochastic.
So we take the other branch of the fork. Instead of learning a value and reading the policy off it, we parameterize the policy directly, πθ(a|s), and push its parameters toward high return — no max, no argmax, continuous actions welcome. That is the policy gradient, and it leads straight back to the value function as a variance-reducing critic. Next lesson.