rl_foundations / lessons / 03 · policy-based lesson 03 / 32

Policy-based — from policy gradient to Actor–Critic

Lesson 02 learned a value and read the policy off it with an argmax. That argmax is a dead end for continuous and huge action spaces, and it can only ever produce a deterministic policy. So stop going through a value: parameterize the policy πθ and push on it directly.

What broke in the value-based branch

In lesson 02 the policy was never a thing we trained — it was a consequence. We learned Q(s,a) and acted greedily: a = argmaxa Q(s,a). That works beautifully when actions are a short discrete list you can scan. It falls apart in three ways:

Each problem has the same root: we are computing the policy through a value. Cut out the middleman. Let the policy be a parameterized function πθ(a|s) — a neural net mapping a state to a distribution over actions — and optimize θ so that the policy itself gets better. For discrete actions that distribution is a softmax over logits; for continuous actions it is, say, a Gaussian whose mean and spread the net outputs. Either way there is no argmax to take.

The objective: just maximize expected return

We carry over every symbol from lessons 01–02: state s, action a, reward r, discount γ, return Gt = ∑k≥0 γk rt+k, value Vπ(s)=𝔼[G|s]. The goal is unchanged — find the policy with the highest expected return — but now it is a smooth function of θ that we can climb with gradient ascent:

J(θ) = 𝔼τ ∼ πθ [ G(τ) ]  =  𝔼τ ∼ πθ [ ∑t γt rt ]  ,   θ ← θ + α ∇θ J(θ)

Here τ = (s0,a0,r0,s1,…) is a trajectory the policy rolls out. The catch is identical to the one in the sibling course's first lesson: the distribution we average over depends on the thing we are differentiating. We cannot push the gradient inside a naive way.

The log-derivative trick — the policy-gradient theorem

One identity unlocks everything. For any distribution pθ and function f, since θ log pθ = ∇θ pθ / pθ:

θ 𝔼a ∼ πθ[ f(a) ] = ∑a f(a) ∇θ πθ(a) = ∑a f(a) πθ(a) ∇θ log πθ(a) = 𝔼a ∼ πθ[ f(a) ∇θ log πθ(a) ]

Apply it to a full trajectory. The dynamics P(s'|s,a) do not depend on θ, so when we differentiate log πθ(τ) the transition terms drop out and only the action log-probabilities survive. The result is the policy-gradient theorem:

θ J(θ) = 𝔼τ ∼ πθ [ ∑t Gt · ∇θ log πθ(at|st) ]

Read it as an instruction: for each action you took, push its log-probability up in proportion to the return that followed it. Good trajectories make their own actions more likely; bad ones make theirs less likely. No model of the environment, no argmax — just sampling and a log π gradient. This is the entire reason policy-based RL exists.

REINFORCE: the gradient you can actually run

The expectation is over all trajectories — uncountably many. Replace it with a Monte-Carlo average over the trajectories you actually sampled. That estimator is REINFORCE (Williams, 1992):

repeat:
    roll out a trajectory  τ = (s0,a0,r0, s1,a1,r1, ...)   under π_θ
    compute returns        G_t = Σ_{k≥0} γ^k r_{t+k}
    estimate gradient      ĝ = Σ_t G_t · ∇ log π_θ(a_t|s_t)
    ascend                 θ ← θ + α · ĝ

It is unbiased: average enough rollouts and ĝ converges to the true ∇J. It needs no transition model and no value function. It is also, in its raw form, almost unusably noisy.

Why REINFORCE is high-variance

Look at what scales the gradient: the raw return Gt. Suppose every action in a task earns a return between +90 and +110. REINFORCE pushes every sampled action's log-probability up, by a large amount, because every return is large and positive. The only signal distinguishing a good action from a slightly-less-good one is the tiny gap between +110 and +90 — drowned in a wave of "everything goes up." With finite samples the estimate jerks around wildly; learning is slow and jittery.

The villain of the next several lessons
The policy gradient is unbiased but high-variance. Variance — not bias — is what makes naive policy gradient impractical, and the whole lineage REINFORCE → Actor–Critic → GAE → TRPO → PPO is, at heart, a sequence of better and better variance-reduction tricks. Lesson 07 dissects this estimator rigorously.

The fix that costs nothing: subtract a baseline

Here is the key algebraic fact. Subtract from each return any quantity b(st) that depends on the state but not on the action:

θ J(θ) = 𝔼 [ ∑t ( Gt − b(st) ) · ∇θ log πθ(at|st) ]

This leaves the gradient exactly unchanged in expectation — it does not bias the estimate. The reason is a one-liner: a baseline that does not depend on the action factors out of the action-expectation, and the expected score function is zero,

𝔼a ∼ πθ [ b(s) · ∇θ log πθ(a|s) ] = b(s) ∑a πθ(a|s) ∇θ log πθ(a|s) = b(s) ∇θa πθ(a|s) = b(s) ∇θ 1 = 0 .

So we get to subtract any state-baseline for free, and we should choose the one that shrinks variance most. Now actions that beat the baseline get pushed up; actions that fall short get pushed down. Instead of "everything rises a lot," we recenter to "better-than-typical rises, worse-than-typical falls." Same expected gradient, dramatically smaller swings.

The best baseline is the value function — and the branches reconverge

Which b(s)? The near-optimal choice is the expected return from that state — which is exactly the state value Vπ(s), the very object the value-based branch in lesson 02 spent all its effort learning. Plug it in and the multiplier Gt − V(st) becomes the advantage:

A(s,a) = Q(s,a) − V(s) ≈ Gt − V(st)  ,    ∇θ J(θ) = 𝔼 [ ∑t A(st,at) · ∇θ log πθ(at|st) ]

The advantage answers exactly the right question: was this action better or worse than what the state typically yields? Positive advantage → make it more likely; negative → less likely; average action → no update.

But where does V(s) come from? We learn it — with the TD methods from lesson 02 — using a second network Vφ with its own parameters φ. Now there are two learners running together:

ACTOR   πθ the policy — chooses actions, updated by ∇log π · A (lesson 03) CRITIC   Vφ the value — scores states, learned by TD (lesson 02) states / actions advantage A = G − V The critic's value is the actor's baseline. The fork from orientation closes here.
The reconvergence
This is the moment the two branches from the orientation rejoin. The actor πθ is the policy-based branch (this lesson). The critic Vφ is the value-based branch (lesson 02). Actor–Critic = a policy that learns, with a value function that critiques it to keep the update low-variance. Almost every modern algorithm — A3C, TRPO, PPO, SAC, and the LLM-era GRPO/PPO — is a flavor of this single structure.

Interactive · the baseline collapses the gradient variance

Five actions, hidden true rewards, policy πθ = softmax(θ) starting uniform — the same toy as lesson 01, now extended to show the baseline. Each step samples one action and applies one REINFORCE update with multiplier (r − b). With the baseline OFF (b=0) every reward is positive, so every action gets shoved up and the per-step gradient estimate thrashes — watch the gradient-variance KPI stay high and the curve stay jagged. Toggle the baseline ON (b = running average reward, a learned V stand-in) and the variance collapses while the policy converges smoothly to the best action.

REINFORCE on 5 actions — baseline OFF vs ON
Top: the policy πθ over actions (green = best true reward, orange = last sampled). Bottom: the running variance of the gradient estimate. Run with baseline OFF first — the bug is the lesson — then turn it ON.
Steps
0
Baseline b
0.00
π(best action)
0.20
Grad variance
Show the core JS (≈22 lines)
// hidden true rewards; C is best. π = softmax(θ).
const rewards = [0.55, 0.62, 0.90, 0.58, 0.50];
let theta = [0,0,0,0,0], bAvg = 0, useBaseline = false;

function step(alpha){
  const pi = softmax(theta);
  const a  = sample(pi);                 // a ~ π_θ
  const r  = rewards[a];                 // observe reward
  const b  = useBaseline ? bAvg : 0;     // baseline (V stand-in) or 0
  const adv = r - b;                     // advantage estimate
  let gradNormSq = 0;
  for (let i = 0; i < theta.length; i++){
    const g = ((i===a?1:0) - pi[i]);     // ∇ log π_a = e_a − π
    const upd = adv * g;                 // (r − b) · ∇log π
    theta[i] += alpha * upd;
    gradNormSq += upd * upd;             // track ‖ĝ‖² for the variance KPI
  }
  bAvg = bAvg===0 ? r : 0.95*bAvg + 0.05*r;   // running mean ≈ V
  return gradNormSq;                     // fed into a running variance
}

The lesson in one observation: the baseline does not change where the policy converges — both settings find action C — it changes how violently the gradient estimate shakes on the way there. Lower variance means you can take bigger, more confident steps without the noise throwing you off course. That is why every serious policy-gradient method carries a critic.

Where this points next

Step back and notice what lessons 02 and 03 have in common: both learn from sampled experience. Q-learning bootstraps from transitions it observed; REINFORCE averages over trajectories it rolled out. They sample because the environment's model — the transition kernel P(s'|s,a) and reward R(s,a) — is unknown. Sampling is how you cope with not knowing the MDP.

But what if you do know it — a board game with explicit rules, a simulator you can query, a learned model of the dynamics? Then you don't have to wait for experience to trickle in. You can plan: sweep the Bellman equations directly (dynamic programming) or search forward through possible futures (Monte-Carlo Tree Search). That is lesson 04, and it turns out AlphaGo is exactly the policy (03) plus the value (02) plus that search, fused.

Takeaway
When the argmax over actions breaks — continuous, huge, or genuinely stochastic — parameterize the policy πθ and ascend ∇J = 𝔼[G · ∇log πθ] (REINFORCE). Its raw form is unbiased but high-variance; subtract a state baseline b(s) — free, because it does not bias the gradient — and the best baseline is the learned value V(s), turning the multiplier into the advantage A = G − V. That fuses the policy-based actor with the value-based critic: Actor–Critic is where the fork reconverges.