rl_foundations / lessons / 02 · value-based lesson 02 / 32

Value-based — from Q-learning to DQN

Lesson 01 left us with the Bellman expectation equation — which needs a policy plugged in. One small change, swapping the average for a max, removes the policy entirely and lets us learn the optimal value Q* directly. That change is the whole of value-based RL.

The delta from lesson 01

Lesson 01 defined the action-value of a fixed policy π and gave its Bellman expectation equation — value of an action = immediate reward plus the discounted value of whatever the policy does next:

Qπ(s,a) = 𝔼s′ [ R(s,a) + γ · 𝔼a′∼π Qπ(s′,a′) ]

That inner expectation 𝔼a′∼π is the catch: it bakes in a policy you must already have. To evaluate a policy, fine. But our goal (lesson 01) is the optimal policy π*, and we'd like to find it without first guessing one.

Here is the one idea. The optimal policy is greedy: in any state it picks the action with the largest value. So in the equation above, replace "what the policy does next" with "the best action next." The average over a′ becomes a max over a′, and the policy vanishes:

Q*(s,a) = 𝔼s′ [ R(s,a) + γ · maxa′ Q*(s′,a′) ]

This is the Bellman optimality equation. It is self-referential — Q* appears on both sides — but it no longer mentions any policy. If we can solve it, the optimal policy is just π*(s) = argmaxa Q*(s,a). That last line is the fork from orientation in action: we never learn a policy object. We learn a number, and the policy falls out of it by greedy choice.

Expectation vs optimality, in one line
Expectation (lesson 01): "how good is acting under this policy?" — uses 𝔼a′∼π. Optimality (this lesson): "how good is acting optimally?" — uses maxa′. The max is the entire move.

From an equation we can't solve to an update we can run

We don't know the transition dynamics or the reward function, so we can't take that expectation 𝔼s′ analytically. But we don't have to. We act in the world and observe one real transition (s, a, r, s′) — one sample of what was inside the expectation. That sample suggests a target for Q(s,a):

target  =   r + γ · maxa′ Q(s′,a′)

Our current estimate Q(s,a) is probably wrong. The gap between the target and the estimate is the temporal-difference (TD) error:

δ  =  r + γ · maxa′ Q(s′,a′)  −  Q(s,a)

If δ > 0, the action was better than we thought — nudge the estimate up. If δ < 0, nudge it down. The size of the nudge is set by a learning rate α ∈ (0,1]. That is TD(0) learning, and when the target uses the max it is exactly Q-learning:

Q(s,a)  ←  Q(s,a) + α [ r + γ · maxa′ Q(s′,a′) − Q(s,a) ]

The phrase "use your own current estimate Q(s′,·) inside the target" is called bootstrapping: you learn a guess from a guess. It is what lets you update after a single step instead of waiting for the episode to end — and, as we'll see, it is also the first ingredient of the bug that ends this lesson.

Off-policy: behave one way, learn another

There's a subtlety hiding in that max. To learn we must visit states and try actions — we must explore. But the target assumes we act greedily next (the max). So the agent behaves one way and learns the value of behaving a different way. Concretely, behavior is ε-greedy:

with probability ε:   pick a random action          (explore)
otherwise:            pick argmax_a Q(s,a)           (exploit)

…while the target always uses maxa′ Q(s′,a′) — the greedy value, regardless of the random action exploration actually took. Learning a target policy (greedy) different from the behavior policy (ε-greedy) is what off-policy means. It's a feature: you can explore wildly and still converge to Q*, because the max in the target keeps aiming at the optimal policy no matter how you collected the data.

Why this connects to the fork
Q-learning never represents a policy. The greedy rule argmaxa Q(s,a) is the implicit policy read off the value at the moment you need to act. Hold this thought: lesson 03 is what happens when that argmax becomes impossible to compute, forcing us to store a policy explicitly.

Where the table breaks

So far Q is a lookup table: one cell per (s,a) pair. That's fine for a 5×5 gridworld with 4 actions — 100 cells. It is hopeless for Atari from pixels (a state is an image: more configurations than atoms in the universe) or for a robot's continuous sensors. You will never visit most states even once, and a table can't say anything about a state it hasn't seen.

The fix is function approximation: replace the table with a parameterized function Q(s,a;θ) — a neural network — that generalizes from visited states to similar unseen ones. Train it by gradient descent on the squared TD error, treating the target as a regression label:

L(θ) = 𝔼 [ ( r + γ · maxa′ Q(s′,a′;θ) − Q(s,a;θ) )2 ]

This is DQN in spirit — the network that learned to play Atari from pixels. But written naively this loss is unstable, and the reason is deep enough that it has a name.

The deadly triad

Three ingredients, each individually fine, become explosive together:

IngredientWhat it isWhere we use it
Bootstrappingtarget uses your own estimate Q(s′,·)the γ max Q term
Off-policylearn the greedy policy while behaving ε-greedythe max vs ε-greedy behavior
Function approximationa network that ties all states' values togetherQ(s,a;θ)

The trouble: the regression label r + γ maxa′ Q(s′,a′;θ) depends on the very θ you are updating. You chase a moving target. Because the network generalizes, raising Q at s also raises it at the similar state s′ — which raises the target — which raises Q(s) again. The feedback loop can spiral Q to infinity. This is the deadly triad (Sutton & Barto): bootstrapping + off-policy + function approximation can diverge even on tiny problems.

DQN tames it with two devices, both of which simply break the feedback loop:

L(θ) = 𝔼(s,a,r,s′)∼buffer [ ( r + γ · maxa′ Q(s′,a′;θ) − Q(s,a;θ) )2 ]

Read the change: the target uses the frozen θ, the prediction uses the live θ. That single decoupling is the difference between a network that learns and one that diverges. The widget below lets you turn it off and watch the triad bite.

Interactive · gridworld Q-learning, with the triad bug on a switch

A 5×5 gridworld. The agent (blue) starts top-left; the goal (green ★) is bottom-right with reward +1; the red pit gives −1; every step costs a small −0.02 to encourage short paths. Each cell shows its greedy value as a shaded patch and the greedy action as an arrow. Press Train and watch the arrows organize into a path to the goal as Q converges.

The Target network toggle is the lesson. With it on, learning is stable. Turn it off and crank α up: to mimic function approximation the demo lets each update "leak" into neighboring cells (generalization), recreating the deadly triad on a table — the values blow up and the policy turns to noise. The bug is the lesson.

Gridworld Q-learning — values, greedy arrows, and the deadly triad
Train to watch Q converge and arrows point home. Toggle the target network OFF (and raise α) to make the bootstrapped, generalizing, off-policy update diverge.
Updates
0
Max |Q|
0.00
Greedy success
Status
ready
Show the core Q-learning update (≈20 lines)
// one off-policy TD(0) / Q-learning step from state s
function qStep(s) {
  const a = (Math.random() < eps) ? randAction()        // explore
                                   : argmaxA(Q[s]);       // exploit (greedy)
  const [s2, r, done] = env(s, a);                        // observe (s,a,r,s')
  // target uses frozen weights when the target net is ON, else live Q
  const Qnext = targetOn ? Qfrozen[s2] : Q[s2];
  const target = r + (done ? 0 : gamma * Math.max(...Qnext));   // bootstrap + max
  const delta  = target - Q[s][a];                              // TD error
  Q[s][a] += alpha * delta;                                     // the update
  return done ? START : s2;
}
// every N steps, refresh the frozen copy (target network)
if (targetOn && updates % N === 0) Qfrozen = clone(Q);

What this branch can't do — and where lesson 03 starts

Every step of Q-learning depends on maxa′ Q(s′,a′) and argmaxa Q(s,a). Computing those means enumerating every action. For a robot whose action is a continuous torque vector, or a recommender choosing among millions of items, you cannot enumerate — the argmax is itself an intractable optimization. Value-based methods also can only ever produce a deterministic greedy policy; they can't represent an optimal policy that is genuinely stochastic.

So we take the other branch of the fork. Instead of learning a value and reading the policy off it, we parameterize the policy directly, πθ(a|s), and push its parameters toward high return — no max, no argmax, continuous actions welcome. That is the policy gradient, and it leads straight back to the value function as a variance-reducing critic. Next lesson.

Takeaway
Swap the policy-averaging in Bellman expectation for a max and you get Bellman optimality, which you can learn directly by the Q-learning update Q ← Q + α[r + γ maxa′ Q′ − Q] — bootstrapping a guess from a guess while behaving ε-greedy (off-policy). Scale it with a neural net and the bootstrapping + off-policy + function-approximation deadly triad makes it diverge; experience replay (decorrelate) and a frozen target network (stop the moving target) tame it — that, plus the implicit greedy policy, is DQN. The catch that opens lesson 03: argmaxa dies on continuous or huge action spaces.