Frontier — discrete to continuous control

A robot's action is a torque vector in ℝ^d. argmax_a over infinitely many actions is impossible — so we make the actor output the action and let the critic's gradient teach it. DDPG → TD3 → SAC.

What broke

Two of our pillars quietly assumed a finite, enumerable action set.

Value-based (lesson 02). Q-learning acts greedily: a* = argmax_a Q(s,a). To compute that argmax you sweep every action and pick the best.
Policy-based (lesson 03). The softmax policy π(a|s) = softmax(θ_s,a) needs one logit per action to normalize over.

Now point a robot arm at a cup. The action is a vector of joint torques a ∈ ℝ^d — continuous, uncountable. There is no list to sweep for the argmax, and no finite set of logits to softmax. You cannot enumerate ℝ^d.

The continuous-control problem

Both branches of the fork (lesson 00) hit a wall at the same place: choosing the best continuous action. We need a way to pick a without ever enumerating the action space — and a way to make that choice differentiable so we can learn it by gradient.

The fix: let the actor output the action

Instead of searching for argmax_a Q(s,a), train a network μ_φ(s) — a deterministic actor — that emits the action directly:

a = μ_φ(s) ∈ ℝ^d

If μ is doing its job, μ_φ(s) ≈ argmax_a Q(s,a) — but it produces the answer in one forward pass, no search. How do we train μ to track the argmax? We have a critic Q_θ(s,a) (lesson 02). We want the action that makes Q large, so we push φ in the direction that increases Q(s, μ_φ(s)). By the chain rule, the gradient flows through the critic into the actor:

∇_φ J(φ) = 𝔼_s [ ∇_a Q_θ(s,a) |_{a=μ_φ(s)} · ∇_φ μ_φ(s) ]

This is the Deterministic Policy Gradient (DPG). Read it: "ask the critic which way to nudge the action to raise Q (∇_aQ), then move the actor's weights to produce that nudge (∇_φμ)." The critic is differentiable in a, so it tells the actor exactly which direction is "uphill" — replacing the discrete argmax with a gradient step.

Why this is still the fork (lesson 00)

This is pure actor–critic. The critic Q_θ learns "how good?" (value branch, lesson 02). The actor μ_φ learns "what to do?" (policy branch, lesson 03). The new twist vs the stochastic policy gradient (∇log π · A, lesson 03/07): here the gradient passes through the critic, so there is no high-variance score-function estimator — the actor learns from the critic's slope directly.

DDPG = DPG + the deep-RL toolkit

Deep Deterministic Policy Gradient (DDPG) takes DPG and bolts on the same stabilizers that let DQN work (lesson 02/06). It is off-policy actor–critic:

Piece	What it does	From
Replay buffer	Store (s,a,r,s') transitions; train on random minibatches to break temporal correlation.	lesson 02 (DQN)
Target networks	Slow copies μ_φ', Q_θ' (Polyak: θ' ← τθ + (1−τ)θ') give a stable bootstrap target.	lesson 02
Exploration noise	μ is deterministic, so to explore we act with a = μ_φ(s) + 𝒩(0,σ).	new — needed because μ has no randomness

The critic is trained by the same Bellman/TD target as Q-learning, except the next action comes from the target actor rather than an argmax:

y = r + γ · Q_θ'( s', μ_φ'(s') ) loss = ( Q_θ(s,a) − y )²

The exploration knob is load-bearing

Set σ = 0 and the agent acts exactly as its current (wrong) actor says. With no noise it never sees the off-policy transitions it needs to discover better actions — learning stalls at whatever the initial actor happened to favor. Zero exploration noise is a bug-is-the-lesson setting in the widget below.

TD3 — three fixes for the overestimation we already met

DDPG inherits a disease from Q-learning: Q-value overestimation. We met it in lesson 02/06 — the max in the bootstrap target systematically picks up positive noise, so the estimate creeps above the truth, and the actor learns to exploit those phantom high values. TD3 (Twin Delayed DDPG) is three targeted patches:

Twin critics → take the min. Train two independent critics Q_θ₁, Q_θ₂ and form the target with the smaller one:
y = r + γ · min_i=1,2 Q_{θ'_i}( s', ã )
A single critic's noise biases up (the actor chases its peaks); the min of two noisy estimates biases down, which is the safe direction. This is exactly the lesson-06 Double-DQN idea, adapted to continuous actions.
Delayed policy updates. Update the actor (and targets) every d critic steps (e.g. d=2). A moving actor chasing a noisy critic is a feedback loop; let the critic settle first.
Target-policy smoothing. Add clipped noise to the target action, ã = μ_φ'(s') + clip(𝒩(0,σ̃), −c, c), so the critic can't overfit a razor-thin spike in Q(s',·). It regularizes the value landscape — similar actions should have similar value.

The throughline

TD3 = DDPG + (twin-min, delay, smoothing). Every one of the three fixes is a variance/optimism control we have seen before, ported into the continuous-action world. The single most important one — twin critics — is what the widget toggles.

SAC — make exploration part of the objective

DDPG/TD3 explore by bolting noise on after the fact, and tuning that noise is fiddly. Soft Actor-Critic (SAC) instead changes the goal: maximize reward plus the policy's own entropy ℋ(π(·|s)). The entropy term pays the agent to stay uncertain — to keep exploring — unless a clearly-better action justifies committing.

J(π) = 𝔼 [ Σ_t γ^t ( r_t + α · ℋ(π(·|s_t)) ) ] ℋ(π) = −𝔼_a∼π[ log π(a|s) ]

This is maximum-entropy RL. Two consequences:

The policy is stochastic again. SAC's actor outputs a distribution (a squashed Gaussian: a = tanh(μ_φ(s) + σ_φ(s)·ϵ)), so exploration is built in, not added — no separate noise schedule. It still uses twin critics + the min, inheriting TD3's overestimation fix.
The reparameterization trick. To backprop the actor through a sample, write the sample as a deterministic function of state and a fixed noise source ϵ∼𝒩(0,I). Now the randomness is an input, not the thing we differentiate, so ∇_φ flows straight through the sampled action into the critic — the continuous-action analogue of the actor gradient above, but for a stochastic policy.

Finally, the temperature α (how much we value entropy) is hard to set by hand and should change as the agent learns. Automatic temperature tunes α by gradient to hit a target entropy ℋ̄ (often −d for a d-dim action): explore widely early, sharpen late, no manual schedule.

The lineage, in one line

DPG (gradient through the critic) → DDPG (+ replay, target nets, action noise) → TD3 (+ twin-min, delayed updates, target smoothing) → SAC (entropy bonus + reparameterized stochastic actor + auto-temperature). All four are off-policy actor–critic for a ∈ ℝ^d.

Interactive · a 1-D reacher with a deterministic actor

A point lives on a line; the target sits at the green tick. The action is a continuous nudge a ∈ [−1, 1] chosen by the actor, executed with exploration noise σ. The critic learns Q(s,a); the actor follows ∇_aQ. Two knobs are bug-is-the-lesson settings:

Exploration σ = 0 → the actor never sees alternative actions; it freezes wherever it started.
Single critic → watch the estimated Q (red) drift above the true return (grey): overestimation. With twin critics (the min) the estimate stays calibrated and the reacher converges.

Continuous 1-D reacher: actor + critic, twin-vs-single

Top: the reacher (blue) chasing the target (green). Bottom: estimated Q (red) vs the true discounted return the actor is currently achieving (grey). Train, then flip the toggles and watch the red line behave.

explore σ: 0.25 twin critics (min)

Train steps

|pos − target|

—

Estimated Q

—

True return

—

Q bias (est − true)

—

Show the core JS (≈30 lines)

// Critic Q(s,a) ≈ linear-in-features; actor a = clip(w·s, -1, 1).
// Reward = -(next distance)^2. min of two critics fights overestimation.
function targetQ(sN) {
  const aN = actor(sN);                       // target action μ(s')
  const q1 = critic1(sN, aN), q2 = critic2(sN, aN);
  return twin ? Math.min(q1, q2) : q1;        // <-- the TD3 fix
}
function trainStep() {
  const a = clip(actor(s) + sigma * randn(), -1, 1);   // explore
  const sN = step(s, a), r = -(dist(sN))**2;           // env
  const y = r + gamma * targetQ(sN);                   // Bellman target
  critics.forEach(c => c.fit(s, a, y));                // TD update
  // deterministic policy gradient: nudge actor along dQ/da
  const g = critic1.dQda(s, actor(s));
  actor.w += lrA * g * s;
  s = sN;
}

Where this goes next

Continuous control is the engine of robotics (lessons 21–23): a manipulator's joint torques and a car's steering/throttle are exactly the a ∈ ℝ^d we just learned to handle, and SAC/TD3 are the workhorses there. It also underlies continuous allocation problems like portfolio weights (lesson 25).

But every method here — DDPG, TD3, SAC — still assumes a live environment to collect fresh transitions into the replay buffer. For a real robot, a patient, or a trading account, online interaction is dangerous or expensive. Next we ask: can we learn from a fixed dataset, with no new interaction at all? That is offline (batch) RL — and it turns out the off-policy bootstrap we just relied on becomes the central hazard.

Takeaway

When actions are continuous you cannot argmax — so a deterministic actor μ_φ(s) emits the action and learns by riding the critic's gradient ∇_aQ (DPG). DDPG wraps this in DQN's replay + target nets + injected exploration noise; TD3 kills the inherited Q-overestimation with twin critics (the min), delayed updates, and target smoothing; SAC makes exploration intrinsic by rewarding entropy, samples through the reparameterization trick, and auto-tunes its temperature.