Frontier — discrete to continuous control
A robot's action is a torque vector in ℝd. argmaxa over infinitely many actions is impossible — so we make the actor output the action and let the critic's gradient teach it. DDPG → TD3 → SAC.
What broke
Two of our pillars quietly assumed a finite, enumerable action set.
- Value-based (lesson 02). Q-learning acts greedily: a* = argmaxa Q(s,a). To compute that argmax you sweep every action and pick the best.
- Policy-based (lesson 03). The softmax policy π(a|s) = softmax(θs,a) needs one logit per action to normalize over.
Now point a robot arm at a cup. The action is a vector of joint torques a ∈ ℝd — continuous, uncountable. There is no list to sweep for the argmax, and no finite set of logits to softmax. You cannot enumerate ℝd.
The fix: let the actor output the action
Instead of searching for argmaxa Q(s,a), train a network μφ(s) — a deterministic actor — that emits the action directly:
If μ is doing its job, μφ(s) ≈ argmaxa Q(s,a) — but it produces the answer in one forward pass, no search. How do we train μ to track the argmax? We have a critic Qθ(s,a) (lesson 02). We want the action that makes Q large, so we push φ in the direction that increases Q(s, μφ(s)). By the chain rule, the gradient flows through the critic into the actor:
This is the Deterministic Policy Gradient (DPG). Read it: "ask the critic which way to nudge the action to raise Q (∇aQ), then move the actor's weights to produce that nudge (∇φμ)." The critic is differentiable in a, so it tells the actor exactly which direction is "uphill" — replacing the discrete argmax with a gradient step.
DDPG = DPG + the deep-RL toolkit
Deep Deterministic Policy Gradient (DDPG) takes DPG and bolts on the same stabilizers that let DQN work (lesson 02/06). It is off-policy actor–critic:
| Piece | What it does | From |
|---|---|---|
| Replay buffer | Store (s,a,r,s') transitions; train on random minibatches to break temporal correlation. | lesson 02 (DQN) |
| Target networks | Slow copies μφ', Qθ' (Polyak: θ' ← τθ + (1−τ)θ') give a stable bootstrap target. | lesson 02 |
| Exploration noise | μ is deterministic, so to explore we act with a = μφ(s) + 𝒩(0,σ). | new — needed because μ has no randomness |
The critic is trained by the same Bellman/TD target as Q-learning, except the next action comes from the target actor rather than an argmax:
TD3 — three fixes for the overestimation we already met
DDPG inherits a disease from Q-learning: Q-value overestimation. We met it in lesson 02/06 — the max in the bootstrap target systematically picks up positive noise, so the estimate creeps above the truth, and the actor learns to exploit those phantom high values. TD3 (Twin Delayed DDPG) is three targeted patches:
- Twin critics → take the min. Train two independent critics Qθ1, Qθ2 and form the target with the smaller one:
y = r + γ · mini=1,2 Qθ'i( s', ã )A single critic's noise biases up (the actor chases its peaks); the min of two noisy estimates biases down, which is the safe direction. This is exactly the lesson-06 Double-DQN idea, adapted to continuous actions.
- Delayed policy updates. Update the actor (and targets) every d critic steps (e.g. d=2). A moving actor chasing a noisy critic is a feedback loop; let the critic settle first.
- Target-policy smoothing. Add clipped noise to the target action, ã = μφ'(s') + clip(𝒩(0,σ̃), −c, c), so the critic can't overfit a razor-thin spike in Q(s',·). It regularizes the value landscape — similar actions should have similar value.
SAC — make exploration part of the objective
DDPG/TD3 explore by bolting noise on after the fact, and tuning that noise is fiddly. Soft Actor-Critic (SAC) instead changes the goal: maximize reward plus the policy's own entropy ℋ(π(·|s)). The entropy term pays the agent to stay uncertain — to keep exploring — unless a clearly-better action justifies committing.
This is maximum-entropy RL. Two consequences:
- The policy is stochastic again. SAC's actor outputs a distribution (a squashed Gaussian: a = tanh(μφ(s) + σφ(s)·ϵ)), so exploration is built in, not added — no separate noise schedule. It still uses twin critics + the min, inheriting TD3's overestimation fix.
- The reparameterization trick. To backprop the actor through a sample, write the sample as a deterministic function of state and a fixed noise source ϵ∼𝒩(0,I). Now the randomness is an input, not the thing we differentiate, so ∇φ flows straight through the sampled action into the critic — the continuous-action analogue of the actor gradient above, but for a stochastic policy.
Finally, the temperature α (how much we value entropy) is hard to set by hand and should change as the agent learns. Automatic temperature tunes α by gradient to hit a target entropy ℋ̄ (often −d for a d-dim action): explore widely early, sharpen late, no manual schedule.
Interactive · a 1-D reacher with a deterministic actor
A point lives on a line; the target sits at the green tick. The action is a continuous nudge a ∈ [−1, 1] chosen by the actor, executed with exploration noise σ. The critic learns Q(s,a); the actor follows ∇aQ. Two knobs are bug-is-the-lesson settings:
- Exploration σ = 0 → the actor never sees alternative actions; it freezes wherever it started.
- Single critic → watch the estimated Q (red) drift above the true return (grey): overestimation. With twin critics (the min) the estimate stays calibrated and the reacher converges.
Where this goes next
Continuous control is the engine of robotics (lessons 21–23): a manipulator's joint torques and a car's steering/throttle are exactly the a ∈ ℝd we just learned to handle, and SAC/TD3 are the workhorses there. It also underlies continuous allocation problems like portfolio weights (lesson 25).
But every method here — DDPG, TD3, SAC — still assumes a live environment to collect fresh transitions into the replay buffer. For a real robot, a patient, or a trading account, online interaction is dangerous or expensive. Next we ask: can we learn from a fixed dataset, with no new interaction at all? That is offline (batch) RL — and it turns out the off-policy bootstrap we just relied on becomes the central hazard.