all lessons / reinforcement learning / 62 · NLP (上) lesson 62 / 87

NLP (上) — machine translation

A translator emits one token at a time, and is graded only at the end. That is a sequential token-action MDP with a terminal, non-differentiable reward — which is exactly the shape that the lesson 10 policy gradient was built for. Today we use it to fix a bug that the usual training objective, cross-entropy, bakes in: exposure bias.

What this lesson reuses

Generation IS a sequential token-action MDP

Decode a translation and watch the loop. You hold the source sentence plus everything you have written so far; you choose the next word; you append it; repeat until you emit the end-of-sentence token. That is the agent–environment loop from the orientation, term for term:

MDP (lesson 01)Text generation (translation)Symbol
state stsource sentence + tokens generated so farst = (x, y<t)
action atthe next token to emit (one word from the vocabulary)at = yt ∈ V
policy πθ(a|s)the model's next-token distributionπθ(yt | x, y<t)
transitiondeterministic: append the chosen tokenst+1 = st ⊕ at
reward ra sequence-level score, e.g. BLEU vs the reference0 until the end, then r = BLEU(y, y*)

One feature dominates everything below: the reward is terminal. You learn nothing per token. Only after the full sentence y = (y1…yT) is complete can the metric look at it and return a single scalar. So the return attached to every token is the same end-of-episode number — this is the long-horizon, sparse-reward credit-assignment problem the spine kept warning about, in its purest form.

The metric you are graded on is not the loss you train on

How is a translation model actually trained? Almost always by teacher-forced cross-entropy: maximize the log-probability of the human reference token, position by position, while feeding the ground-truth prefix at every step.

LCE(θ) = − ∑t=1T log πθ( y*t | x, y*<t )

Note the starred prefix y*<t: at training time, to predict token t the model conditions on the true first t−1 tokens, never on its own. This is supervised, instructive learning — there is one correct token at each position and the gradient is certain (the orientation's "instructive, not evaluative" feedback). It is fast and stable. It also has two cracks that policy gradient was made to fill:

Exposure bias: the train/test mismatch the loss can't see

The deeper crack is the starred prefix. During training the model only ever sees correct history. At test time there is no reference to feed — the model must condition on the tokens it just generated, some of which will be wrong. It is now in a state it never visited during training, makes a slightly worse next prediction, which makes the next state worse still, and the errors compound down the sentence. This is exposure bias.

  TRAIN (teacher forcing)            TEST (free-running)
  ----------------------             -------------------
  predict y_t  |  given y*_<t         predict y_t  |  given ŷ_<t   (its OWN tokens)
               |  always correct                  |  may contain earlier mistakes
               ▼                                   ▼
  never sees its own mistakes        one slip → off the training manifold
                                     → worse next state → errors compound

If this feels familiar, it should: it is precisely the covariate shift / compounding-error failure of behavioral cloning from lesson 17. Teacher-forced cross-entropy is behavioral cloning of the reference, and it suffers the same distribution mismatch — the model is trained on the expert's state distribution but tested on its own.

Why the loss can never warn you
Cross-entropy is computed on the ground-truth prefix, so it measures the model in states it will never actually occupy at test time. Training loss can fall happily while free-running quality stalls or degrades — the metric that matters is computed on a different state distribution than the one being optimized. The objective is structurally blind to its own deployment condition.

The fix: REINFORCE directly on the sequence reward

Both cracks close at once if we stop imitating tokens and instead optimize the sequence-level reward of sequences the model itself generates. Define the objective as expected reward over the model's own samples — the lesson 10 objective, with the model as policy and the sentence as the trajectory:

J(θ) = 𝔼y ∼ πθ(·|x) [ r(y, y*) ]

We cannot differentiate r through y — but we do not need to. The log-derivative trick of lesson 10 moves the gradient onto the log-probability, which is differentiable, and leaves the reward as a plain scalar weight:

θ J = 𝔼y ∼ πθ [ r(y, y*) · ∇θ log πθ(y | x) ]  =  𝔼 [ r(y) · ∑tθ log πθ(yt | x, y<t) ]

This is REINFORCE. Sample a full translation from the model, score it with BLEU, then push up the log-probability of every token in that sentence in proportion to its score. Two cracks sealed: the objective is now BLEU itself (not a likelihood proxy), and BLEU never had to be differentiated (it is just a number multiplying the score). And crucially, the model is now trained on its own samples — the very state distribution it faces at test time — so exposure bias has nowhere to hide.

Self-critical sequence training = REINFORCE + the lesson 10 baseline

Raw REINFORCE on sentences is brutally high-variance, the villain of lesson 10: one scalar reward weighting a long product of token log-probs. The same fix applies — subtract a state-dependent baseline b that does not bias the gradient:

θ J = 𝔼 [ ( r(y) − b ) · ∇θ log πθ(y | x) ]

Self-critical sequence training picks a baseline that needs no extra network: run the model's own greedy decode ŷ and use its score b = r(ŷ). A sampled sentence is reinforced only if it beats what the model would have said greedily — better-than-the-model's-best samples rise, worse ones fall, centered on zero, exactly the advantage A = r(y) − r(ŷ) of lesson 10. (And yes — this is the same group/self baseline idea that GRPO and RLHF reuse at LLM scale in lessons 15–16.)

Interactive · cross-entropy vs sequence-reward — watch exposure bias

A toy translator: source tokens map to a short reference sentence over a tiny vocabulary. Two policies train on the same data. Cross-entropy learns teacher-forced (always conditioned on the true prefix). Sequence-reward learns by REINFORCE on a BLEU-like sequence score of its own samples, with a self-critical baseline. Both are then judged the only way that matters — free-running: decode from scratch, each token conditioned on the model's own previous tokens. The slider blends the two training signals. Slide it all the way to pure cross-entropy and watch the free-running score sit below its own teacher-forced score: that gap is exposure bias.

Sequence generator: token cross-entropy ⇄ sequence reward
Top-left: per-position next-token distributions (free-running decode); the gold cell is the reference token at that position. Top-right: three scores over training — teacher-forced score (grey), free-running score (blue), and the reward-optimizing policy's free-running score (green). The gap between grey and blue for the cross-entropy model is exposure bias. Set reward weight to 0 (pure cross-entropy) to make the bug appear.
Train steps
0
Teacher-forced
0.00
Free-running
0.00
Exposure-bias gap
0.00
Show the JS that runs this widget (≈28 lines)
// θ[t][a] are logits for the token at position t; π = softmax over vocab.
// CROSS-ENTROPY: teacher-forced — push up the reference token, given true prefix.
function ceUpdate(t, lr){
  const p = softmax(theta[t]);
  for (let a = 0; a < V; a++) theta[t][a] += lr * ((a === ref[t] ? 1 : 0) - p[a]);
}
// SEQUENCE REWARD: sample a full sentence from the model's OWN tokens (free-run),
// score it, subtract the greedy decode's score (self-critical baseline), REINFORCE.
function rlUpdate(lr){
  const y = []; for (let t = 0; t < T; t++) y.push(sample(softmax(theta[t])));
  const g = []; for (let t = 0; t < T; t++) g.push(argmax(theta[t]));
  const adv = bleu(y) - bleu(g);                       // r(y) - r(ŷ): the baseline
  for (let t = 0; t < T; t++){
    const p = softmax(theta[t]);
    for (let a = 0; a < V; a++)                          // ∇log π = e_a - π
      theta[t][a] += lr * adv * ((a === y[t] ? 1 : 0) - p[a]);
  }
}
// Each train step mixes them by the reward-weight knob w:
//   w=0 → pure cross-entropy (exposure bias);  w=1 → pure sequence reward.

What you should see. At reward weight = 0 (pure cross-entropy) the teacher-forced score (grey) climbs fast toward perfect — the loss is happy. But the free-running score (blue) lags well below it and plateaus: when the model must condition on its own tokens, an early sampling slip drops it into a position it was never trained on, and the rest of the sentence drifts. That stubborn grey–blue gap is exposure bias, and the loss cannot see it. Now drag the knob toward 1: the green curve — the policy trained on its own samples with the self-critical baseline — pulls its free-running score up to match teacher-forcing, because it was optimized in exactly the state distribution it is tested in. The pure-cross-entropy setting is the bug; the sequence-reward setting is the lesson.

Map back to the spine

Pin machine translation onto the value / policy / model map:

Spine conceptIn machine translation
MDP (lesson 01)sequential token decoding — state = source + prefix, deterministic append transition
Policy πθ(a|s)the model's next-token distribution πθ(yt|x, y<t)
Reward ra terminal, non-differentiable sequence score (BLEU) at end-of-sentence
Policy gradient (lesson 10)REINFORCE: ∇J = 𝔼[ r(y) · ∇log πθ(y) ]
Baseline → advantage (lesson 10)self-critical: A = r(y) − r(ŷgreedy)
The domain's own bugexposure bias = covariate shift (lesson 17) of teacher-forced cross-entropy
Forward to RLHF (lesson 16)swap BLEU for a learned human-preference reward — same machinery

So this is pure policy-gradient RL with a terminal, non-differentiable reward. Value-based methods are awkward here (the action space is the whole vocabulary at every step), so we parameterize the policy directly and nudge it with REINFORCE — the policy branch of the fork, doing exactly what lesson 10 proved it could. The only thing standing between this lesson and full RLHF is the source of the reward: BLEU is a cheap, checkable, programmatic metric; a human-preference reward model is the same scalar with no closed form. Lesson 63 takes the next step — multi-turn dialogue, where the reward is even softer and the horizon longer.

Takeaway
Machine translation is a sequential token-action MDP whose reward (BLEU) is terminal and non-differentiable. Training with teacher-forced cross-entropy optimizes per-token likelihood — not the metric — and conditions on ground-truth prefixes the model never sees at test time, so it suffers exposure bias: feed it its own tokens and an early slip compounds, exactly the covariate shift of behavioral cloning. The fix is the lesson 10 policy gradient applied to the sequence reward, ∇J = 𝔼[(r(y) − b)·∇log πθ(y)], with a self-critical baseline — REINFORCE trained on the model's own samples — and that very same "optimize a non-differentiable sequence reward" move, with BLEU replaced by a human-preference model, is RLHF (lesson 16).