Deep RL — from DQN to A3C

Both prongs of the fork get a neural network. Then they get an army of them: parallel actors that make on-policy learning stable without a replay buffer.

Where we are

The foundations are done. We have the MDP, the two ways to solve it, planning when the model is known, and exploration. Lesson 02 already crossed into deep territory once — DQN = Q-learning + a neural net + experience replay + a target network — and lesson 03 crossed it the other way — Actor–Critic = a policy network π_θ nudged by an advantage from a value network V_φ.

So "deep RL" is not a third idea. It is the same fork — value-based on the left, policy-based on the right — with the tables replaced by θ-parameterized function approximators, and then scaled with parallelism. This lesson does two things: finishes the value branch by naming the three fixes that turned DQN from a demo into a workhorse, then follows the policy branch as it discovers a completely different way to get stable gradients — not a buffer, but a crowd of workers.

_θ, replay buffer, target net POLICY-BASED · Actor–Critic (L03) π_θ, critic V_φ, advantage A + Double · Dueling · Prioritized three patches → Rainbow + Parallel actors → A3C / A2C decorrelation replaces replay

The DQN family, one line each

Plain DQN works, but three flaws show up the moment you push it. Each has a famous one-patch fix; together they (and a few others) make up "Rainbow," the strong baseline.

1 · Double DQN — fixing max-overestimation

DQN's target is y = r + γ · max_a' Q_θ⁻(s', a'). The trouble is the max. The network's Q estimates are noisy, and taking the max of noisy numbers systematically picks the ones that happen to be overestimated — so the target is biased high, and the bias compounds through bootstrapping. Double DQN breaks the self-reinforcing loop by splitting the two roles of that max: use the online net to choose the action, the target net to evaluate it.

y = r + γ · Q_θ⁻( s', argmax_a' Q_θ(s', a') )

Selection and evaluation no longer share the same noise, so a spuriously high value can be picked but won't also score itself. The overestimation largely cancels. (We will see this exact "twin estimators kill the optimistic max" trick return for continuous control as TD3 in lesson 15.)

2 · Dueling DQN — separating V and A

In many states the choice of action barely matters — the floor is on fire no matter what you do. Forcing one head to learn a separate Q(s,a) for every action wastes capacity. Dueling splits the network into a state-value stream V_θ(s) and an advantage stream A_θ(s,a), then recombines them — using the same identity Q = V + A from lesson 03, just baked into the architecture:

Q_θ(s,a) = V_θ(s) + ( A_θ(s,a) − mean_a' A_θ(s,a') )

The mean-subtraction is just identifiability — without it V and A could drift by an offset that cancels. The payoff: V(s) is learned from every transition through that state, regardless of action, so good states get evaluated fast.

3 · Prioritized replay — sampling surprising transitions

Uniform replay (lesson 02) treats every stored transition as equally worth re-learning. But a transition the network already predicts perfectly teaches nothing. Prioritized experience replay samples a transition with probability proportional to its surprise — its TD-error magnitude |δ|, the same δ = r + γ Q' − Q we have been using:

P(i) ∝ |δ_i|^α ( α = 0 → uniform; α > 0 → focus on the surprising )

Because this changes the sampling distribution, you must correct the bias with importance-sampling weights — a small foreshadow of lesson 09, where the ratio ρ becomes a central character. Net effect: the agent spends its compute re-learning the transitions that still hurt.

The pattern under all three

Each is one surgical patch on a single failure of plain DQN: Double fixes the biased max, Dueling fixes wasted capacity, Prioritized fixes wasted samples. Stack them (plus multi-step, distributional, noisy nets) and you get Rainbow. The value branch is now industrial-strength — but every one of these still leans on a replay buffer to break correlation.

The policy branch can't use a replay buffer

Why does DQN need replay at all? Because consecutive transitions in one episode are highly correlated — frame t looks almost exactly like frame t+1. Feeding a network a long correlated stream is like training on the same near-duplicate minibatch over and over: gradients point the same way for a while, the net overfits to that stretch, then the stream shifts and it lurches. Replay fixes this by storing transitions and sampling them uniformly at random, shuffling the correlation away.

But replay is fundamentally off-policy: it replays actions taken by an older version of the policy. Q-learning doesn't mind — its bootstrap target uses max_a' Q, which doesn't depend on which policy collected the data. Policy gradient is different. The policy-gradient estimate from lesson 03,

∇_θ J(θ) = 𝔼_{π_θ} [ A(s,a) · ∇_θ log π_θ(a|s) ],

is an expectation under the current policy π_θ. The moment θ updates, every transition in the buffer is stale — it was sampled from the wrong distribution, so reusing it biases the gradient. (Reusing it correctly requires importance reweighting; that is exactly lesson 09's job, and it has its own costs.) So Actor–Critic seems stuck: it needs decorrelated data, but it can't store and shuffle, because stored data goes off-policy instantly.

A3C — decorrelate in space, not in time

The escape, due to Mnih et al. (2016), is almost obvious in hindsight. If you can't decorrelate by shuffling across time (a buffer), decorrelate across space: run many actors in parallel, each with its own copy of the policy and its own independent environment, exploring different states at the same moment. At any instant the batch of transitions arriving from N workers is naturally diverse — because the workers are in genuinely different situations — so the gradient averaged over them is decorrelated without ever storing anything. The data is fresh, hence on-policy, hence the advantage from lesson 03 applies directly.

This is A3C — Asynchronous Advantage Actor–Critic. Unpack the name backward:

Actor–Critic — straight from lesson 03: an actor π_θ updated by A · ∇ log π, a critic V_φ supplying the baseline.
Advantage — each worker computes an n-step advantage A ≈ ( Σ γ^k r_t+k ) + γⁿ V_φ(s_t+n) − V_φ(s_t) from its own short rollout.
Asynchronous — each worker computes a gradient on its own slice and pushes it to a shared set of parameters, no lockstep. The shared parameters are the only thing the workers have in common; their environments and random seeds are independent.

         shared θ, φ  (the global net)
        ┌──────────────────────────────┐
        │   ← grad   ← grad   ← grad     │
        └──────────────────────────────┘
            ▲           ▲           ▲
        worker 1    worker 2    worker N      ← each: own env, own seed
        env₁ ≠      env₂ ≠      env_N            roll out n steps,
        rollout     rollout     rollout          compute A·∇logπ,
                                                  push gradient, re-sync θ

A2C is the synchronous sibling: a controller waits for all workers, averages their gradients into one batch, then steps once. It turns out the asynchrony in A3C was never the point — the parallelism was. A2C gets the same decorrelation, is simpler, and on a GPU is usually faster. Either way the lesson is the same: N independent actors are the on-policy replacement for the replay buffer.

The catch that drives the next four lessons

A3C/A2C are on-policy: the instant θ updates, the just-collected batch is spent and must be thrown away. There is no buffer to mine. That makes on-policy policy gradient sample-hungry — it burns fresh environment interaction every step. Two pressures follow: (1) squeeze more signal out of each batch → better advantage estimation (lessons 07–08), and (2) safely reuse old data → importance sampling and trust regions (lessons 09–11). The whole "rigorous PG → GAE → IS → TRPO → PPO" arc exists to pay down this bill.

Interactive · workers vs. correlation

The widget below runs N parallel actors on a tiny looping environment. Each step, every worker advances in its own episode and emits a transition; we batch them and take one Actor–Critic gradient step on a shared policy. The KPI to watch is batch correlation: how similar the transitions in one update are to each other.

Slide #workers up and correlation falls — diverse workers, diverse states, a clean gradient, and the learning curve climbs smoothly. Drop to 1 worker and you are back to a single correlated stream: correlation pins near 1, the gradient jerks, and the reward curve thrashes instead of rising. Then flip identical seeds on: now all N workers march in lockstep through the same states — so even with many workers the batch is fully correlated and learning degrades just as badly. More workers only help if they are independent.

N parallel actors filling a shared gradient batch

Each frame: every worker takes a step in its own env, we form one batch, take one A2C-style update on the shared policy. Watch batch correlation drop as workers are added — and watch it pin back to ~1 (and learning fall apart) with 1 worker or with identical seeds.

#workers: 8 identical seeds (kills decorrelation)

Updates

Batch correlation

—

Avg reward (EMA)

0.00

Gradient jitter

—

Show the core JS (≈25 lines)

// N workers, each a position in its own looping env. Reward peaks at the goal cell.
function collectBatch(workers) {
  return workers.map(w => {
    const pi = softmax(theta[w.s]);     // shared policy θ, indexed by state
    const a  = sample(pi);
    const sNext = step(w.s, a);          // env transition (per-worker)
    const r  = reward(sNext);
    const adv = r + GAMMA * V[sNext] - V[w.s];   // 1-step advantage (critic V)
    w.s = sNext;
    return { s: w.s, a, adv, pi };
  });
}
// Decorrelation = how spread out the batch's states are. One worker => one
// correlated stream => corr≈1 => the averaged gradient is just noise.
function update(batch) {
  const grad = zeros();
  for (const t of batch)               // average A·∇logπ over the batch
    for (let i = 0; i < A; i++)
      grad[t.s][i] += t.adv * ((i===t.a?1:0) - t.pi[i]) / batch.length;
  for (let s = 0; s < S; s++) for (let i=0;i<A;i++) theta[s][i] += ETA * grad[s][i];
}

Reading the widget

Many independent workers (8–16): correlation settles low, gradient jitter is small, the reward EMA rises steadily. This is the A3C/A2C regime — parallelism standing in for the replay buffer.
One worker: correlation pins near 1.0. Every update sees nearly the same transition the last one did, so the gradient pushes hard in one direction, overshoots, and the reward curve oscillates instead of climbing. This is the failure replay was invented to prevent — and the failure parallel actors prevent differently.
Identical seeds, many workers: the population count is high but the diversity is zero — every worker is in the same state every step. Correlation stays near 1.0 and learning degrades just like the single-worker case. The takeaway: it is independence, not headcount, that buys decorrelation.

The bug is the lesson

Sixteen workers that all share a seed are no better than one worker. "Scaling" RL is not "add more machines" — it is "add more independent experience." A cluster of perfectly synchronized actors collecting identical trajectories is an expensive way to run a single correlated stream.

How this sits on the fork

We have now scaled both prongs. The value prong (DQN + Double/Dueling/Prioritized) leans on a replay buffer and stays off-policy. The policy prong (Actor–Critic → A3C/A2C) refuses the buffer, decorrelates with parallel actors, and stays on-policy — paying for it in sample hunger. From here the course follows the policy prong almost exclusively, because that is the lineage that leads to TRPO, PPO, and the LLM era. The first debt to pay: we have used ∇J = 𝔼[A · ∇log π] three times now without proving it. Next lesson, we earn it.

Takeaway

Deep RL = the same value/policy fork with neural nets, then scaled. The value branch hardens DQN with three one-line patches — Double (kill the optimistic max), Dueling (split Q = V + A), Prioritized (replay by surprise |δ|) — but stays off-policy on a replay buffer. The policy branch can't use that buffer (PG is on-policy), so A3C/A2C decorrelate with N independent parallel actors instead — which works only when the actors are genuinely independent, and which makes on-policy PG sample-hungry, setting up the rigor and reuse of lessons 07–11.