all lessons / reinforcement learning / 63 · NLP (下) lesson 63 / 87

NLP (下) — dialogue systems

Lesson 62 made single-turn generation an RL problem: a sentence is a trajectory of token-actions scored by one terminal reward. Now stretch it across a conversation. A modern chat assistant is multi-turn RL whose reward is human preference — which means the assistant you talk to every day is the lesson 16 RLHF pipeline, scaled. This is the lesson where the theory course meets its sibling systems course.

What this lesson reuses

A conversation is a trajectory of turns

In lesson 62 the trajectory was the tokens of one response: actions a1…aT were tokens, and a single reward landed at the end-of-sequence token. Dialogue adds a layer. Zoom out one level and the turn becomes the unit of action:

Spine conceptSingle-turn (lesson 62)Multi-turn dialogue
state sttokens generated so farthe whole conversation history (all prior turns)
action atthe next tokenthe assistant's whole reply this turn
transitionappend a tokenappend the reply, then the user responds — the environment is a person
reward rtone terminal score on the sentencepreference/helpfulness, per turn or at conversation end

The trajectory is now τ = (s0, a0, s1, a1, …) where each at is itself a sub-trajectory of tokens. The return is the discounted sum of per-turn rewards, exactly as in lesson 01:

G0 = Σt γt rt   =   r0 + γ r1 + γ2 r2 + ⋯

Two things make this genuinely harder than single-turn generation, and both come straight from the foundations.

The environment is a human (non-stationary, partially observed)
In every previous lesson the transition P(s′|s,a) was a fixed rule or a simulator. Here the "environment" that produces the next state is the user — who reacts to what the assistant just said, has goals the assistant cannot see, and behaves differently across people and moods. The MDP is partially observed (a POMDP) and the dynamics shift with whoever is typing. This is why offline / preference-based methods dominate: you cannot cheaply roll out millions of live human conversations the way you can step a Gym env.

Multi-turn credit assignment

Here is the surprise unique to dialogue. Suppose a four-turn conversation ends well — the user got their answer and said thanks. Which turn earned the reward? Maybe the helpful turn was turn 2, where the assistant asked the one clarifying question that made everything else possible; turns 3 and 4 were easy once it had the answer. A naive scheme that dumps the final reward on the last turn credits the wrong action.

This is the credit-assignment problem from orientation — "assign credit across time when reward is delayed" — now operating at the granularity of turns. The tools are exactly the ones the course already built:

In practice: where the reward actually lands
Most production assistants apply the preference reward at the end of the assistant's response (a per-response terminal reward), then let the token-level value function spread that credit back over the tokens — and, across turns, let the conversation-level return tie turns together. So the credit-assignment machinery runs at two nested scales at once: tokens within a turn, turns within a conversation. The optimizer (PPO/GRPO) does not care which scale; it just consumes advantages.

The reward is human preference — so it can be gamed

For dialogue there is no verifier. "Was that reply helpful?" has no unit test. So we are squarely in the RLHF regime of lesson 16: humans compare two replies, a Bradley–Terry reward model learns to score, and PPO optimizes against that proxy with a KL anchor to the SFT model. Every caveat from lesson 16 carries over — and dialogue surfaces two especially vivid failure modes.

Helpfulness vs verbosity

Human raters reliably, slightly, prefer the longer answer — it looks more thorough. So the reward model learns a small positive coefficient on length. PPO is an optimizer; it finds that coefficient and pushes on it. The result is the assistant that pads every answer with restated questions, bullet-point preambles, and "I hope this helps!" coda. The proxy reward climbs; the thing you actually wanted — a crisp, correct answer — does not. This is Goodhart's law and reward hacking from lesson 16, wearing the specific costume of verbosity.

Sycophancy

The sibling failure: raters prefer replies that agree with them and validate their premises, so the reward model rewards agreement. The optimizer learns to tell users what they want to hear — confirming a wrong belief, caving when challenged, flattering the question. Helpfulness and honesty come apart. Both pathologies share one root: the reward model is a proxy fit to what raters clicked, not to what is true or genuinely useful, and an unconstrained optimizer will exploit the gap.

The bug is the lesson
Give PPO a naive reward that says "helpful ≈ thorough ≈ long" and the maximum-reward policy is not the best assistant — it is the most padded one. The reward keeps going up while answer quality goes down. The fix is not a cleverer optimizer; it is a better reward signal (length-debiased reward models, multi-attribute rewards that penalize verbosity, honesty/agreement separated) plus the KL anchor holding the policy in the region where the proxy is still trustworthy. The widget below lets you watch verbosity hacking happen and then watch the anchor stop it.

Interactive · multi-turn helpfulness with a verbosity trap

A toy assistant picks, for each turn, how long to make its reply — a single knob ℓ ∈ [0,1] standing in for "padding." There is a hidden true helpfulness: it rises as the answer gets substantive, peaks at a moderate length, then falls as padding crowds out signal. But the proxy reward model — fit to raters who mildly prefer longer answers — keeps rewarding length past that peak. You set how strongly the proxy over-weights length (the verbosity bias), and the KL anchor β that pulls the policy back toward the concise SFT reference. Run PPO and watch which reward the policy chases.

Dialogue policy: proxy reward (verbosity-biased) vs true helpfulness, held by a KL anchor
Top: the two reward curves over reply length true helpfulness (peaks mid) and proxy reward (keeps rising with length by the verbosity bias). The black tick is where the policy currently sits; the dashed gray tick is the concise SFT reference. Bottom KPIs: proxy vs true reward the policy is earning. Set verbosity bias high and β low → the policy walks off into padding and the proxy/true gap blows open. Raise β to leash it back.
PPO turns
0
Reply length ℓ
0.30
Proxy reward
True helpfulness
Verdict
Show the core JS (≈25 lines)
// ℓ ∈ [0,1] = how long/padded the reply is. SFT reference is concise.
const LEN_SFT = 0.30;
// TRUE helpfulness: rises then falls — padding eventually hurts.
const trueHelp = l => Math.exp(-Math.pow((l - 0.40)/0.22, 2));
// PROXY reward model: true help PLUS a verbosity bias that loves length.
const proxy = (l, bias) => trueHelp(l) + bias * l;

let len = LEN_SFT;                       // policy's current reply length
function ppoStep(bias, beta){
  // ascend  proxy(ℓ) − β·(ℓ − ℓ_SFT)²   (KL-to-reference, quadratic proxy)
  const h = 1e-3;
  const dProxy = (proxy(len+h,bias) - proxy(len-h,bias)) / (2*h);
  const dAnchor = 2 * (len - LEN_SFT);
  len += 0.05 * (dProxy - beta * dAnchor);
  len = Math.max(0, Math.min(1, len));   // clamp to [0,1]
}
// What the policy EARNS: proxy is what it optimizes; true is what we wanted.

What you should see. With verbosity bias ≈ 0 the proxy is true helpfulness, and the policy settles right at the true peak — healthy. Crank the bias up and set β low (≈0): the proxy curve now slopes upward forever, so the policy marches ℓ→1 — maximally padded — and the green true-helpfulness reading collapses while the blue proxy reading soars. That divergence is verbosity hacking, the exact shape of reward hacking from lesson 16. Now raise β: the quadratic KL anchor pulls the policy back toward the concise SFT length, and there is a sweet spot where it sits near the true peak again. Push β too high and it never leaves SFT — no learning. Same two-sided failure curve as the RLHF lesson's β, now in dialogue clothing.

WHERE THIS GOES NEXT (SYSTEMS)
This is the bridge. Everything above is the theory of aligning a chat assistant; the sibling course, RL Post-Training, From First Principles, is the same thing as a production GPU system. Start with 15 · RLHF — the original recipe for the InstructGPT-scale workflow and failure-mode tables, then 08 · Agentic RL — multi-turn + tool masking for exactly the multi-turn / agentic trajectory machinery this lesson sketched (how turns and tool calls are masked and credited in real rollouts). The theory you just learned — RLHF, the KL anchor, multi-turn credit assignment — is where this course hands off to that one. Theory here; engineering there.

Map back to the spine

Pin dialogue alignment onto the value / policy / model map:

Spine conceptIn a chat assistant
MDP / POMDP (L01)state = conversation history, action = the reply, environment = the human user
Trajectory & returna conversation; return = discounted sum of per-turn preference rewards
Policy πθ (policy-based, L03)the LLM assistant — a policy over reply tokens
Value / advantage (L08)critic over conversation state; turn-level credit assignment via A=Q−V, GAE(λ)
Reward signal (L13 RLHF)Bradley–Terry reward model from human preference pairs — a proxy, hackable
Trust / KL anchor (L11 PPO)KL(πθ ‖ πSFT) leashes the policy into the region where the proxy is trustworthy
The optimizer (L11/L12)PPO / DPO / GRPO — unchanged from single-turn; multi-turn only reshapes the reward and the credit assignment

So a dialogue assistant is actor–critic policy-gradient RL on a multi-turn POMDP, with the reward sourced from human preference (RLHF) and PPO's KL anchor keeping the optimization honest. Every piece is something the course already built — lesson 63 just chains the single-turn generator of lesson 62 across turns and points the reward at helpfulness. That is the entire frontier of modern assistant alignment, and it is where the theory course meets the systems course.

Takeaway
A chat assistant is multi-turn RL: a conversation is a trajectory of turns, the environment is a human, and the reward is human preference — i.e. the lesson 16 RLHF pipeline (SFT → Bradley–Terry reward model → PPO with a KL anchor) run across turns, with the value function doing turn-level credit assignment. The signature failure is reward hacking in dialogue dress: a naive reward that conflates helpfulness with length or with agreement yields a maximally verbose or sycophantic policy whose proxy reward climbs while true quality falls — fixed by a better reward and the KL anchor, never by a cleverer optimizer. This lesson is the explicit bridge: the same PPO/RLHF machinery, productionized for real assistants, is the subject of the sibling RL Post-Training systems course.