rl_foundations / lessons / 15 · continuous control lesson 15 / 32

Frontier — discrete to continuous control

A robot's action is a torque vector in ℝd. argmaxa over infinitely many actions is impossible — so we make the actor output the action and let the critic's gradient teach it. DDPG → TD3 → SAC.

What broke

Two of our pillars quietly assumed a finite, enumerable action set.

Now point a robot arm at a cup. The action is a vector of joint torques a ∈ ℝd — continuous, uncountable. There is no list to sweep for the argmax, and no finite set of logits to softmax. You cannot enumerate ℝd.

The continuous-control problem
Both branches of the fork (lesson 00) hit a wall at the same place: choosing the best continuous action. We need a way to pick a without ever enumerating the action space — and a way to make that choice differentiable so we can learn it by gradient.

The fix: let the actor output the action

Instead of searching for argmaxa Q(s,a), train a network μφ(s) — a deterministic actor — that emits the action directly:

a = μφ(s)  ∈  ℝd

If μ is doing its job, μφ(s) ≈ argmaxa Q(s,a) — but it produces the answer in one forward pass, no search. How do we train μ to track the argmax? We have a critic Qθ(s,a) (lesson 02). We want the action that makes Q large, so we push φ in the direction that increases Q(s, μφ(s)). By the chain rule, the gradient flows through the critic into the actor:

φ J(φ) = 𝔼s [ ∇a Qθ(s,a) |a=μφ(s) · ∇φ μφ(s) ]

This is the Deterministic Policy Gradient (DPG). Read it: "ask the critic which way to nudge the action to raise Q (aQ), then move the actor's weights to produce that nudge (φμ)." The critic is differentiable in a, so it tells the actor exactly which direction is "uphill" — replacing the discrete argmax with a gradient step.

Why this is still the fork (lesson 00)
This is pure actor–critic. The critic Qθ learns "how good?" (value branch, lesson 02). The actor μφ learns "what to do?" (policy branch, lesson 03). The new twist vs the stochastic policy gradient (∇log π · A, lesson 03/07): here the gradient passes through the critic, so there is no high-variance score-function estimator — the actor learns from the critic's slope directly.

DDPG = DPG + the deep-RL toolkit

Deep Deterministic Policy Gradient (DDPG) takes DPG and bolts on the same stabilizers that let DQN work (lesson 02/06). It is off-policy actor–critic:

PieceWhat it doesFrom
Replay bufferStore (s,a,r,s') transitions; train on random minibatches to break temporal correlation.lesson 02 (DQN)
Target networksSlow copies μφ', Qθ' (Polyak: θ' ← τθ + (1−τ)θ') give a stable bootstrap target.lesson 02
Exploration noiseμ is deterministic, so to explore we act with a = μφ(s) + 𝒩(0,σ).new — needed because μ has no randomness

The critic is trained by the same Bellman/TD target as Q-learning, except the next action comes from the target actor rather than an argmax:

y = r + γ · Qθ'( s', μφ'(s') )     loss = ( Qθ(s,a) − y )2
The exploration knob is load-bearing
Set σ = 0 and the agent acts exactly as its current (wrong) actor says. With no noise it never sees the off-policy transitions it needs to discover better actions — learning stalls at whatever the initial actor happened to favor. Zero exploration noise is a bug-is-the-lesson setting in the widget below.

TD3 — three fixes for the overestimation we already met

DDPG inherits a disease from Q-learning: Q-value overestimation. We met it in lesson 02/06 — the max in the bootstrap target systematically picks up positive noise, so the estimate creeps above the truth, and the actor learns to exploit those phantom high values. TD3 (Twin Delayed DDPG) is three targeted patches:

  1. Twin critics → take the min. Train two independent critics Qθ1, Qθ2 and form the target with the smaller one:
    y = r + γ · mini=1,2 Qθ'i( s', ã )
    A single critic's noise biases up (the actor chases its peaks); the min of two noisy estimates biases down, which is the safe direction. This is exactly the lesson-06 Double-DQN idea, adapted to continuous actions.
  2. Delayed policy updates. Update the actor (and targets) every d critic steps (e.g. d=2). A moving actor chasing a noisy critic is a feedback loop; let the critic settle first.
  3. Target-policy smoothing. Add clipped noise to the target action, ã = μφ'(s') + clip(𝒩(0,σ̃), −c, c), so the critic can't overfit a razor-thin spike in Q(s',·). It regularizes the value landscape — similar actions should have similar value.
The throughline
TD3 = DDPG + (twin-min, delay, smoothing). Every one of the three fixes is a variance/optimism control we have seen before, ported into the continuous-action world. The single most important one — twin critics — is what the widget toggles.

SAC — make exploration part of the objective

DDPG/TD3 explore by bolting noise on after the fact, and tuning that noise is fiddly. Soft Actor-Critic (SAC) instead changes the goal: maximize reward plus the policy's own entropy ℋ(π(·|s)). The entropy term pays the agent to stay uncertain — to keep exploring — unless a clearly-better action justifies committing.

J(π) = 𝔼 [ Σt γt ( rt + α · ℋ(π(·|st)) ) ]     ℋ(π) = −𝔼a∼π[ log π(a|s) ]

This is maximum-entropy RL. Two consequences:

Finally, the temperature α (how much we value entropy) is hard to set by hand and should change as the agent learns. Automatic temperature tunes α by gradient to hit a target entropy ℋ̄ (often −d for a d-dim action): explore widely early, sharpen late, no manual schedule.

The lineage, in one line
DPG (gradient through the critic) → DDPG (+ replay, target nets, action noise) → TD3 (+ twin-min, delayed updates, target smoothing) → SAC (entropy bonus + reparameterized stochastic actor + auto-temperature). All four are off-policy actor–critic for a ∈ ℝd.

Interactive · a 1-D reacher with a deterministic actor

A point lives on a line; the target sits at the green tick. The action is a continuous nudge a ∈ [−1, 1] chosen by the actor, executed with exploration noise σ. The critic learns Q(s,a); the actor follows aQ. Two knobs are bug-is-the-lesson settings:

Continuous 1-D reacher: actor + critic, twin-vs-single
Top: the reacher (blue) chasing the target (green). Bottom: estimated Q (red) vs the true discounted return the actor is currently achieving (grey). Train, then flip the toggles and watch the red line behave.
Train steps
0
|pos − target|
Estimated Q
True return
Q bias (est − true)
Show the core JS (≈30 lines)
// Critic Q(s,a) ≈ linear-in-features; actor a = clip(w·s, -1, 1).
// Reward = -(next distance)^2. min of two critics fights overestimation.
function targetQ(sN) {
  const aN = actor(sN);                       // target action μ(s')
  const q1 = critic1(sN, aN), q2 = critic2(sN, aN);
  return twin ? Math.min(q1, q2) : q1;        // <-- the TD3 fix
}
function trainStep() {
  const a = clip(actor(s) + sigma * randn(), -1, 1);   // explore
  const sN = step(s, a), r = -(dist(sN))**2;           // env
  const y = r + gamma * targetQ(sN);                   // Bellman target
  critics.forEach(c => c.fit(s, a, y));                // TD update
  // deterministic policy gradient: nudge actor along dQ/da
  const g = critic1.dQda(s, actor(s));
  actor.w += lrA * g * s;
  s = sN;
}

Where this goes next

Continuous control is the engine of robotics (lessons 21–23): a manipulator's joint torques and a car's steering/throttle are exactly the a ∈ ℝd we just learned to handle, and SAC/TD3 are the workhorses there. It also underlies continuous allocation problems like portfolio weights (lesson 25).

But every method here — DDPG, TD3, SAC — still assumes a live environment to collect fresh transitions into the replay buffer. For a real robot, a patient, or a trading account, online interaction is dangerous or expensive. Next we ask: can we learn from a fixed dataset, with no new interaction at all? That is offline (batch) RL — and it turns out the off-policy bootstrap we just relied on becomes the central hazard.

Takeaway
When actions are continuous you cannot argmax — so a deterministic actor μφ(s) emits the action and learns by riding the critic's gradient aQ (DPG). DDPG wraps this in DQN's replay + target nets + injected exploration noise; TD3 kills the inherited Q-overestimation with twin critics (the min), delayed updates, and target smoothing; SAC makes exploration intrinsic by rewarding entropy, samples through the reparameterization trick, and auto-tunes its temperature.