NLP (上) — machine translation
A translator emits one token at a time, and is graded only at the end. That is a sequential token-action MDP with a terminal, non-differentiable reward — which is exactly the shape that the lesson 10 policy gradient was built for. Today we use it to fix a bug that the usual training objective, cross-entropy, bakes in: exposure bias.
- Policy gradient — lesson 10. The proven estimator ∇J = 𝔼[ G · ∇log πθ ], the reward-to-go and the baseline. Here the "policy" is a language model, the "action" is a token, and the return is a single number handed back after the whole sentence is written. Self-critical sequence training is REINFORCE with the baseline trick of lesson 10.
- Foreshadows RLHF — lesson 16. "Optimize a sequence-level reward you cannot differentiate" is the exact move RLHF makes — only there the reward is a learned human-preference model instead of BLEU. Get the mechanism here on a clean, checkable metric; meet the human-sourced version in lesson 16.
Generation IS a sequential token-action MDP
Decode a translation and watch the loop. You hold the source sentence plus everything you have written so far; you choose the next word; you append it; repeat until you emit the end-of-sentence token. That is the agent–environment loop from the orientation, term for term:
| MDP (lesson 01) | Text generation (translation) | Symbol |
|---|---|---|
| state st | source sentence + tokens generated so far | st = (x, y<t) |
| action at | the next token to emit (one word from the vocabulary) | at = yt ∈ V |
| policy πθ(a|s) | the model's next-token distribution | πθ(yt | x, y<t) |
| transition | deterministic: append the chosen token | st+1 = st ⊕ at |
| reward r | a sequence-level score, e.g. BLEU vs the reference | 0 until the end, then r = BLEU(y, y*) |
One feature dominates everything below: the reward is terminal. You learn nothing per token. Only after the full sentence y = (y1…yT) is complete can the metric look at it and return a single scalar. So the return attached to every token is the same end-of-episode number — this is the long-horizon, sparse-reward credit-assignment problem the spine kept warning about, in its purest form.
The metric you are graded on is not the loss you train on
How is a translation model actually trained? Almost always by teacher-forced cross-entropy: maximize the log-probability of the human reference token, position by position, while feeding the ground-truth prefix at every step.
Note the starred prefix y*<t: at training time, to predict token t the model conditions on the true first t−1 tokens, never on its own. This is supervised, instructive learning — there is one correct token at each position and the gradient is certain (the orientation's "instructive, not evaluative" feedback). It is fast and stable. It also has two cracks that policy gradient was made to fill:
- It optimizes the wrong objective. Per-token likelihood is not BLEU. A translation can put high probability on every reference word yet score poorly as a sentence (wrong word order, a dropped negation), and two outputs with identical cross-entropy can have very different BLEU. You are graded on a sequence-level metric you never directly optimized.
- BLEU is non-differentiable. It counts n-gram overlaps and clips them — there is no ∂BLEU/∂θ to backprop. You cannot simply "make the loss be BLEU" and call gradient descent.
Exposure bias: the train/test mismatch the loss can't see
The deeper crack is the starred prefix. During training the model only ever sees correct history. At test time there is no reference to feed — the model must condition on the tokens it just generated, some of which will be wrong. It is now in a state it never visited during training, makes a slightly worse next prediction, which makes the next state worse still, and the errors compound down the sentence. This is exposure bias.
TRAIN (teacher forcing) TEST (free-running)
---------------------- -------------------
predict y_t | given y*_<t predict y_t | given ŷ_<t (its OWN tokens)
| always correct | may contain earlier mistakes
▼ ▼
never sees its own mistakes one slip → off the training manifold
→ worse next state → errors compound
If this feels familiar, it should: it is precisely the covariate shift / compounding-error failure of behavioral cloning from lesson 17. Teacher-forced cross-entropy is behavioral cloning of the reference, and it suffers the same distribution mismatch — the model is trained on the expert's state distribution but tested on its own.
The fix: REINFORCE directly on the sequence reward
Both cracks close at once if we stop imitating tokens and instead optimize the sequence-level reward of sequences the model itself generates. Define the objective as expected reward over the model's own samples — the lesson 10 objective, with the model as policy and the sentence as the trajectory:
We cannot differentiate r through y — but we do not need to. The log-derivative trick of lesson 10 moves the gradient onto the log-probability, which is differentiable, and leaves the reward as a plain scalar weight:
This is REINFORCE. Sample a full translation from the model, score it with BLEU, then push up the log-probability of every token in that sentence in proportion to its score. Two cracks sealed: the objective is now BLEU itself (not a likelihood proxy), and BLEU never had to be differentiated (it is just a number multiplying the score). And crucially, the model is now trained on its own samples — the very state distribution it faces at test time — so exposure bias has nowhere to hide.
Self-critical sequence training = REINFORCE + the lesson 10 baseline
Raw REINFORCE on sentences is brutally high-variance, the villain of lesson 10: one scalar reward weighting a long product of token log-probs. The same fix applies — subtract a state-dependent baseline b that does not bias the gradient:
Self-critical sequence training picks a baseline that needs no extra network: run the model's own greedy decode ŷ and use its score b = r(ŷ). A sampled sentence is reinforced only if it beats what the model would have said greedily — better-than-the-model's-best samples rise, worse ones fall, centered on zero, exactly the advantage A = r(y) − r(ŷ) of lesson 10. (And yes — this is the same group/self baseline idea that GRPO and RLHF reuse at LLM scale in lessons 15–16.)
Interactive · cross-entropy vs sequence-reward — watch exposure bias
A toy translator: source tokens map to a short reference sentence over a tiny vocabulary. Two policies train on the same data. Cross-entropy learns teacher-forced (always conditioned on the true prefix). Sequence-reward learns by REINFORCE on a BLEU-like sequence score of its own samples, with a self-critical baseline. Both are then judged the only way that matters — free-running: decode from scratch, each token conditioned on the model's own previous tokens. The slider blends the two training signals. Slide it all the way to pure cross-entropy and watch the free-running score sit below its own teacher-forced score: that gap is exposure bias.
What you should see. At reward weight = 0 (pure cross-entropy) the teacher-forced score (grey) climbs fast toward perfect — the loss is happy. But the free-running score (blue) lags well below it and plateaus: when the model must condition on its own tokens, an early sampling slip drops it into a position it was never trained on, and the rest of the sentence drifts. That stubborn grey–blue gap is exposure bias, and the loss cannot see it. Now drag the knob toward 1: the green curve — the policy trained on its own samples with the self-critical baseline — pulls its free-running score up to match teacher-forcing, because it was optimized in exactly the state distribution it is tested in. The pure-cross-entropy setting is the bug; the sequence-reward setting is the lesson.
Map back to the spine
Pin machine translation onto the value / policy / model map:
| Spine concept | In machine translation |
|---|---|
| MDP (lesson 01) | sequential token decoding — state = source + prefix, deterministic append transition |
| Policy πθ(a|s) | the model's next-token distribution πθ(yt|x, y<t) |
| Reward r | a terminal, non-differentiable sequence score (BLEU) at end-of-sentence |
| Policy gradient (lesson 10) | REINFORCE: ∇J = 𝔼[ r(y) · ∇log πθ(y) ] |
| Baseline → advantage (lesson 10) | self-critical: A = r(y) − r(ŷgreedy) |
| The domain's own bug | exposure bias = covariate shift (lesson 17) of teacher-forced cross-entropy |
| Forward to RLHF (lesson 16) | swap BLEU for a learned human-preference reward — same machinery |
So this is pure policy-gradient RL with a terminal, non-differentiable reward. Value-based methods are awkward here (the action space is the whole vocabulary at every step), so we parameterize the policy directly and nudge it with REINFORCE — the policy branch of the fork, doing exactly what lesson 10 proved it could. The only thing standing between this lesson and full RLHF is the source of the reward: BLEU is a cheap, checkable, programmatic metric; a human-preference reward model is the same scalar with no closed form. Lesson 63 takes the next step — multi-turn dialogue, where the reward is even softer and the horizon longer.