Policy-based — from policy gradient to Actor–Critic
Lesson 02 learned a value and read the policy off it with an argmax. That argmax is a dead end for continuous and huge action spaces, and it can only ever produce a deterministic policy. So stop going through a value: parameterize the policy πθ and push on it directly.
What broke in the value-based branch
In lesson 02 the policy was never a thing we trained — it was a consequence. We learned Q(s,a) and acted greedily: a = argmaxa Q(s,a). That works beautifully when actions are a short discrete list you can scan. It falls apart in three ways:
- Continuous actions. A robot joint torque is a real number; there is no list to scan. argmaxa Q(s,a) becomes its own optimization problem at every step.
- Huge action spaces. Tens of thousands of items to recommend, or a vocabulary of tokens — the argmax over Q is intractable to compute and to learn accurately everywhere.
- Stochastic optima. Greedy-on-Q is deterministic. But the optimal policy is sometimes genuinely random — rock-paper-scissors, or any partially-observed game where being predictable gets you punished. A value function with a deterministic argmax cannot represent "play each move 1/3 of the time."
Each problem has the same root: we are computing the policy through a value. Cut out the middleman. Let the policy be a parameterized function πθ(a|s) — a neural net mapping a state to a distribution over actions — and optimize θ so that the policy itself gets better. For discrete actions that distribution is a softmax over logits; for continuous actions it is, say, a Gaussian whose mean and spread the net outputs. Either way there is no argmax to take.
The objective: just maximize expected return
We carry over every symbol from lessons 01–02: state s, action a, reward r, discount γ, return Gt = ∑k≥0 γk rt+k, value Vπ(s)=𝔼[G|s]. The goal is unchanged — find the policy with the highest expected return — but now it is a smooth function of θ that we can climb with gradient ascent:
Here τ = (s0,a0,r0,s1,…) is a trajectory the policy rolls out. The catch is identical to the one in the sibling course's first lesson: the distribution we average over depends on the thing we are differentiating. We cannot push the gradient inside a naive way.
The log-derivative trick — the policy-gradient theorem
One identity unlocks everything. For any distribution pθ and function f, since ∇θ log pθ = ∇θ pθ / pθ:
Apply it to a full trajectory. The dynamics P(s'|s,a) do not depend on θ, so when we differentiate log πθ(τ) the transition terms drop out and only the action log-probabilities survive. The result is the policy-gradient theorem:
Read it as an instruction: for each action you took, push its log-probability up in proportion to the return that followed it. Good trajectories make their own actions more likely; bad ones make theirs less likely. No model of the environment, no argmax — just sampling and a log π gradient. This is the entire reason policy-based RL exists.
REINFORCE: the gradient you can actually run
The expectation is over all trajectories — uncountably many. Replace it with a Monte-Carlo average over the trajectories you actually sampled. That estimator is REINFORCE (Williams, 1992):
repeat:
roll out a trajectory τ = (s0,a0,r0, s1,a1,r1, ...) under π_θ
compute returns G_t = Σ_{k≥0} γ^k r_{t+k}
estimate gradient ĝ = Σ_t G_t · ∇ log π_θ(a_t|s_t)
ascend θ ← θ + α · ĝ
It is unbiased: average enough rollouts and ĝ converges to the true ∇J. It needs no transition model and no value function. It is also, in its raw form, almost unusably noisy.
Why REINFORCE is high-variance
Look at what scales the gradient: the raw return Gt. Suppose every action in a task earns a return between +90 and +110. REINFORCE pushes every sampled action's log-probability up, by a large amount, because every return is large and positive. The only signal distinguishing a good action from a slightly-less-good one is the tiny gap between +110 and +90 — drowned in a wave of "everything goes up." With finite samples the estimate jerks around wildly; learning is slow and jittery.
The fix that costs nothing: subtract a baseline
Here is the key algebraic fact. Subtract from each return any quantity b(st) that depends on the state but not on the action:
This leaves the gradient exactly unchanged in expectation — it does not bias the estimate. The reason is a one-liner: a baseline that does not depend on the action factors out of the action-expectation, and the expected score function is zero,
So we get to subtract any state-baseline for free, and we should choose the one that shrinks variance most. Now actions that beat the baseline get pushed up; actions that fall short get pushed down. Instead of "everything rises a lot," we recenter to "better-than-typical rises, worse-than-typical falls." Same expected gradient, dramatically smaller swings.
The best baseline is the value function — and the branches reconverge
Which b(s)? The near-optimal choice is the expected return from that state — which is exactly the state value Vπ(s), the very object the value-based branch in lesson 02 spent all its effort learning. Plug it in and the multiplier Gt − V(st) becomes the advantage:
The advantage answers exactly the right question: was this action better or worse than what the state typically yields? Positive advantage → make it more likely; negative → less likely; average action → no update.
But where does V(s) come from? We learn it — with the TD methods from lesson 02 — using a second network Vφ with its own parameters φ. Now there are two learners running together:
Interactive · the baseline collapses the gradient variance
Five actions, hidden true rewards, policy πθ = softmax(θ) starting uniform — the same toy as lesson 01, now extended to show the baseline. Each step samples one action and applies one REINFORCE update with multiplier (r − b). With the baseline OFF (b=0) every reward is positive, so every action gets shoved up and the per-step gradient estimate thrashes — watch the gradient-variance KPI stay high and the curve stay jagged. Toggle the baseline ON (b = running average reward, a learned V stand-in) and the variance collapses while the policy converges smoothly to the best action.
The lesson in one observation: the baseline does not change where the policy converges — both settings find action C — it changes how violently the gradient estimate shakes on the way there. Lower variance means you can take bigger, more confident steps without the noise throwing you off course. That is why every serious policy-gradient method carries a critic.
Where this points next
Step back and notice what lessons 02 and 03 have in common: both learn from sampled experience. Q-learning bootstraps from transitions it observed; REINFORCE averages over trajectories it rolled out. They sample because the environment's model — the transition kernel P(s'|s,a) and reward R(s,a) — is unknown. Sampling is how you cope with not knowing the MDP.
But what if you do know it — a board game with explicit rules, a simulator you can query, a learned model of the dynamics? Then you don't have to wait for experience to trickle in. You can plan: sweep the Bellman equations directly (dynamic programming) or search forward through possible futures (Monte-Carlo Tree Search). That is lesson 04, and it turns out AlphaGo is exactly the policy (03) plus the value (02) plus that search, fused.