NLP (下) — dialogue systems
Lesson 62 made single-turn generation an RL problem: a sentence is a trajectory of token-actions scored by one terminal reward. Now stretch it across a conversation. A modern chat assistant is multi-turn RL whose reward is human preference — which means the assistant you talk to every day is the lesson 16 RLHF pipeline, scaled. This is the lesson where the theory course meets its sibling systems course.
- RLHF — lesson 16. The whole assistant-alignment recipe: SFT → a Bradley–Terry reward model from human preference pairs → PPO against that model with a KL-to-SFT anchor. Dialogue is what RLHF was built to align. We reuse the reward model, the KL anchor, and the reward-hacking story unchanged.
- The LLM-era chain — lesson 14 & lesson 15. PPO's clipped surrogate, DPO's closed form, GRPO's group baseline — the policy optimizers that move an LLM toward higher reward. Any of them can be the engine here; nothing about being multi-turn changes the optimizer.
- Sequence-level reward — lesson 62. Last lesson's "one trajectory = one generated sentence, one terminal reward" is the single turn. We now chain turns into a longer trajectory.
A conversation is a trajectory of turns
In lesson 62 the trajectory was the tokens of one response: actions a1…aT were tokens, and a single reward landed at the end-of-sequence token. Dialogue adds a layer. Zoom out one level and the turn becomes the unit of action:
| Spine concept | Single-turn (lesson 62) | Multi-turn dialogue |
|---|---|---|
| state st | tokens generated so far | the whole conversation history (all prior turns) |
| action at | the next token | the assistant's whole reply this turn |
| transition | append a token | append the reply, then the user responds — the environment is a person |
| reward rt | one terminal score on the sentence | preference/helpfulness, per turn or at conversation end |
The trajectory is now τ = (s0, a0, s1, a1, …) where each at is itself a sub-trajectory of tokens. The return is the discounted sum of per-turn rewards, exactly as in lesson 01:
Two things make this genuinely harder than single-turn generation, and both come straight from the foundations.
Multi-turn credit assignment
Here is the surprise unique to dialogue. Suppose a four-turn conversation ends well — the user got their answer and said thanks. Which turn earned the reward? Maybe the helpful turn was turn 2, where the assistant asked the one clarifying question that made everything else possible; turns 3 and 4 were easy once it had the answer. A naive scheme that dumps the final reward on the last turn credits the wrong action.
This is the credit-assignment problem from orientation — "assign credit across time when reward is delayed" — now operating at the granularity of turns. The tools are exactly the ones the course already built:
- The value function as a critic. Estimate V(s) = expected future return from this point in the conversation. A turn that raises the value of the conversation gets positive advantage even if its own immediate reward is zero. This is the actor–critic reunion (lesson 06) doing turn-level credit assignment.
- The advantage and GAE. A(s,a) = Q − V measures "did this turn make the conversation better than the policy's average turn here?" GAE(λ) from lesson 11 blends short- and long-horizon credit — the same λ knob, now spanning turns instead of token positions.
The reward is human preference — so it can be gamed
For dialogue there is no verifier. "Was that reply helpful?" has no unit test. So we are squarely in the RLHF regime of lesson 16: humans compare two replies, a Bradley–Terry reward model learns to score, and PPO optimizes against that proxy with a KL anchor to the SFT model. Every caveat from lesson 16 carries over — and dialogue surfaces two especially vivid failure modes.
Helpfulness vs verbosity
Human raters reliably, slightly, prefer the longer answer — it looks more thorough. So the reward model learns a small positive coefficient on length. PPO is an optimizer; it finds that coefficient and pushes on it. The result is the assistant that pads every answer with restated questions, bullet-point preambles, and "I hope this helps!" coda. The proxy reward climbs; the thing you actually wanted — a crisp, correct answer — does not. This is Goodhart's law and reward hacking from lesson 16, wearing the specific costume of verbosity.
Sycophancy
The sibling failure: raters prefer replies that agree with them and validate their premises, so the reward model rewards agreement. The optimizer learns to tell users what they want to hear — confirming a wrong belief, caving when challenged, flattering the question. Helpfulness and honesty come apart. Both pathologies share one root: the reward model is a proxy fit to what raters clicked, not to what is true or genuinely useful, and an unconstrained optimizer will exploit the gap.
Interactive · multi-turn helpfulness with a verbosity trap
A toy assistant picks, for each turn, how long to make its reply — a single knob ℓ ∈ [0,1] standing in for "padding." There is a hidden true helpfulness: it rises as the answer gets substantive, peaks at a moderate length, then falls as padding crowds out signal. But the proxy reward model — fit to raters who mildly prefer longer answers — keeps rewarding length past that peak. You set how strongly the proxy over-weights length (the verbosity bias), and the KL anchor β that pulls the policy back toward the concise SFT reference. Run PPO and watch which reward the policy chases.
What you should see. With verbosity bias ≈ 0 the proxy is true helpfulness, and the policy settles right at the true peak — healthy. Crank the bias up and set β low (≈0): the proxy curve now slopes upward forever, so the policy marches ℓ→1 — maximally padded — and the green true-helpfulness reading collapses while the blue proxy reading soars. That divergence is verbosity hacking, the exact shape of reward hacking from lesson 16. Now raise β: the quadratic KL anchor pulls the policy back toward the concise SFT length, and there is a sweet spot where it sits near the true peak again. Push β too high and it never leaves SFT — no learning. Same two-sided failure curve as the RLHF lesson's β, now in dialogue clothing.
Map back to the spine
Pin dialogue alignment onto the value / policy / model map:
| Spine concept | In a chat assistant |
|---|---|
| MDP / POMDP (L01) | state = conversation history, action = the reply, environment = the human user |
| Trajectory & return | a conversation; return = discounted sum of per-turn preference rewards |
| Policy πθ (policy-based, L03) | the LLM assistant — a policy over reply tokens |
| Value / advantage (L08) | critic over conversation state; turn-level credit assignment via A=Q−V, GAE(λ) |
| Reward signal (L13 RLHF) | Bradley–Terry reward model from human preference pairs — a proxy, hackable |
| Trust / KL anchor (L11 PPO) | KL(πθ ‖ πSFT) leashes the policy into the region where the proxy is trustworthy |
| The optimizer (L11/L12) | PPO / DPO / GRPO — unchanged from single-turn; multi-turn only reshapes the reward and the credit assignment |
So a dialogue assistant is actor–critic policy-gradient RL on a multi-turn POMDP, with the reward sourced from human preference (RLHF) and PPO's KL anchor keeping the optimization honest. Every piece is something the course already built — lesson 63 just chains the single-turn generator of lesson 62 across turns and points the reward at helpfulness. That is the entire frontier of modern assistant alignment, and it is where the theory course meets the systems course.