gpt_mini / lessons / 03 · sft lesson 3 / 6

Supervised Fine-Tuning

One mask line, one chat template, and a text completer starts following instructions.

The gap pretraining leaves behind

At the end of lesson 2 we had a model that has learned P(text): feed it any prefix and it produces a plausible continuation. That is enormously useful, but it is the wrong shape for what we actually want, which is a function P(response | prompt). The model has no notion of request versus answer; it only has a notion of what comes next in text that looks like its training corpus.

The canonical failure is concrete and slightly comic. Type What is 2+2? into a pretrained model and watch it continue:

prompt:     What is 2+2?
generated:  What is 3+3? What is 4+4? What is 5+5? ...

Nothing about that output is wrong, exactly. It is in-distribution. The model has seen many strings that look like the start of an arithmetic worksheet, and a worksheet continues with more questions, not with the answers. The pretrained policy is doing precisely what it was optimized to do — maximizing log P(text) on a string that happens to be the opening of a worksheet — and that objective never said anything about answering anyone. The fact that the corpus also contains question-answer text does not help here: in a worksheet you do not insert the answer mid-quiz. The model's continuation is contextually correct, behaviorally useless.

The same model is also perfectly capable of producing the answer, in the right framing. Prompt it with Q: What is 2+2?\nA: and you will likely get 4. The knowledge is in the weights. What is missing is a reliable dispatcher — something that takes "this is a user request" and routes it to "produce a response" rather than "produce more of the same kind of text". Installing that dispatcher is what SFT does.

The one-sentence reframing
Pretraining gives you a distribution over text. SFT does not replace it; it carves out a particular conditional slice — P(response | prompt) for the subset of "text" that happens to be formatted as a user/assistant turn — and pushes gradient only at that slice.

The fix in one sentence

Restrict the loss to response positions only, on (prompt, response) pairs.

That is the entire idea. We do not change the architecture, the optimizer, the vocabulary, the loss function, the tokenizer, the attention pattern, the positional encoding, the weight init, or the training loop's outer shape. We do not introduce a new gradient estimator, a new sampling protocol, or a new model on the side. We change two things: the data (now formatted as prompt/response pairs) and which positions count toward the loss (only the response). Everything else carries over from lesson 2 verbatim.

This is worth dwelling on because almost every later technique in this series — CoT, DPO, RLVR — preserves the same loss machinery and changes one specific other thing. The pattern of "the architecture is fixed, point the loss differently" is the whole pedagogical arc.

The chat template — a token-only protocol

The model has exactly one input channel: tokens. So if we want to tell it "this part is the user's request and that part is your response", we must say it using tokens. We need a delimiter scheme — a tiny protocol — embedded in the text itself.

In 01_sft.py the scheme is three single-character specials:

'<'   USER_START      — start of the user's message
'>'   ASSISTANT_START — point at which the model should begin responding
'#'   END_OF_TURN     — stop marker

Any unambiguous delimiter scheme works equally well. Real models use multi-token specials with distinctive byte patterns to avoid collision with normal text — for example <|im_start|>user, <|im_start|>assistant, <|im_end|> in the ChatML family, or the more elaborate Llama 3 template with header tokens and role names. The choice between schemes matters for production (collision resistance, tokenizer efficiency, conditional formatting for tools and system prompts), but it does not matter for understanding the mechanism. What matters is that:

A single formatted example, for the toy task of reversing a four-letter word:

< h e l l o > o l l e h #

Grayed tokens are the prompt (everything up to and including >); brighter tokens are the response (everything the model should produce, including the terminating #). At inference time we feed the model the prompt portion only, then sample greedily until it emits # or we hit a length cap.

Design note: why single-char specials in the toy
The toy uses single characters because the tokenizer is char-level and the vocab has 29 entries (26 letters + 3 specials). In a BPE world you would not pick characters that already appear in natural text — that's why production templates use multi-byte markers with otherwise-impossible sequences. The pedagogical point is that the scheme is a convention the model learns by repeated exposure, not a hardware feature.

The loss

Pretraining minimized the next-token cross-entropy summed over every position in a sequence. SFT minimizes the same quantity summed only over response positions:

LSFT(θ) = − 𝔼(x,y) ∼ D Σt ∈ response log pθ(xt+1 | x≤t)

Walking through the parts:

A subtle but important point: prompt positions still run the forward pass. They have to, because each response token attends to every preceding position, including all of the prompt. If you somehow skipped the prompt at forward time, the response tokens would have nothing to condition on. What we are skipping is the backward pass at those positions: prompt positions contribute zero to the scalar loss, so they contribute zero gradient. The model sees the prompt, but is not asked to predict it.

Another way to read the same equation: pretraining's gradient is summed over every next-token prediction in the sequence; SFT's gradient is summed only over predictions that step into the response. The total number of gradient terms per example drops — typically by 60–90% — but each term is now pointed at exactly the thing we want the model to learn.

The implementation: one line

In code, the entire SFT-specific behavior collapses into two things: a target tensor whose prompt positions are set to IGNORE_INDEX, and a call to F.cross_entropy with ignore_index=IGNORE_INDEX. From 01_sft.py:

s = torch.tensor(ids, dtype=torch.long)    # full sequence, length T
x = s[:-1]                                  # (T-1,) — model input
y = s[1:].clone()                           # (T-1,) — shifted targets
y[: P - 1] = IGNORE_INDEX                   # mask prompt; IGNORE_INDEX = -100 (the F.cross_entropy default)

# later, in the training step:
loss = F.cross_entropy(
    logits.view(-1, V),
    y.view(-1),
    ignore_index=IGNORE_INDEX,
)

The interesting line is y[: P - 1] = IGNORE_INDEX. Why P - 1 and not P?

It's the shift. The sequence s has length T = P + R where P is prompt length and R is response length. We compute x = s[:-1] (length T-1) and y = s[1:] (length T-1). At position i of x, the model is asked to predict s[i+1], which is stored at y[i]. The first response token is s[P], and that token is the target at y[P-1]. So the indices we want to train on are y[P-1], y[P], … , y[T-2]. The indices we want to mask are y[0], … , y[P-2] — i.e. y[:P-1].

Easy off-by-one: if you write y[:P] you also mask the very first response prediction, which is the one where the model goes from "I'm reading the prompt" to "now I'm producing the answer" — arguably the single most important transition to train. Conversely, if you write y[:P-2] you train on one position of prompt prediction, which leaks a tiny bit of "learn to generate user text" into the loss. The shift-by-one is genuinely worth drawing on paper the first time you implement this.

shift-by-1 target alignment: each x[i] predicts y[i] = s[i+1] s < h e l l o > o l l e h x[i] 0 1 2 3 4 5 6 7 8 9 10 y[i] h e l l o > o l l e h # y[0 .. P-2] ← IGNORE_INDEX y[P-1 .. T-2] ← trained on here P = 7 (prompt = "<hello>"), so y[:6] is masked, y[6..10] is trained

Why loss masking matters — the ablation

It is tempting to ask: is masking really necessary? What if we just trained on the whole sequence as if it were pretraining? The answer is "you get a much worse model", and the reason is worth being explicit about.

Without masking, the loss at a prompt position pushes the model to predict the next token of the user's message given the previous tokens of the user's message. Concretely, at position 2 of <hello>olleh# (which holds h in the input), the unmasked loss would penalize the model whenever p(e | <, h) is low. That gradient is telling the model "after a user types <h, the next thing should be e" — which is to say, you are partially training a model that generates user input.

Equivalently: the unmasked variant is "more pretraining on slightly-weirdly-formatted text". The bytes of the prompt come from whatever distribution generated user messages in the dataset; the loss at those positions tries to fit that distribution. None of that improves the model's response ability; in fact, since the optimizer has finite capacity at finite training steps, every gradient spent on prompt prediction is gradient not spent on the actual task.

01_sft.py runs this ablation explicitly: train two models on the same data with the same optimizer, same number of steps, same seed — the only diff is whether mask_prompt is True or False. The accuracy gap is large and reliable. The widget below shows the qualitative shape of what happens.

Loss mask visualizer
Each chip is a token at training time. The bar beneath each chip is its gradient contribution. Toggle the mask to see which positions actually get trained on. Watch the "predicted next char" panel to see the failure mode of training on prompt positions.

What the model predicts at each position

Hardcoded illustrative probabilities. With MASK ON, prompt-position predictions are gibberish (no gradient flowed there — and that's fine because we ignore them). With MASK OFF, the model is forced to fit prompt predictions, and accuracy on the response collapses.
Response acc (masked)
high illust.
Response acc (unmasked)
low illust.
Trained positions
Ignored positions

Right-pad, not left-pad

If your batch contains variable-length examples, you need padding to make them rectangular. There are two ways to do it: right-pad (real tokens first, then PAD) or left-pad (PAD first, then real tokens). For training, you must right-pad. For inference with a batched generator, people sometimes left-pad. Mixing these up silently breaks training.

The reason is causal attention. The model at position t attends to positions 0 … t; it cannot see the future. If you left-pad, the prompt is preceded by a sequence of PAD tokens. The first real prompt token now has only PAD tokens in its left context. The hidden state at that position is computed by attending to PAD-embedded vectors — not nothing, not noise, but specifically the embeddings for the PAD token. The model has to learn to ignore these, which it can do given a pad-mask, but the masking has to be wired through both attention and the loss.

what causal attention sees at the highlighted position right-pad < c a t > t a c # P P attends to: < c a t > (the prompt) — correct left-pad P P < c a t > t a c # attends to: P P < c a t > (PAD embeddings pollute the prefix — need an attention mask)

For SFT training the simplest correct answer is: right-pad and ignore-index the pad targets so the loss skips them. Causal attention naturally hides everything after position t, so trailing PADs cannot leak backward into any real token's context. That is exactly what we want.

Inference is a different setting because at generation time the prompts in a batch have different lengths and you want the model to start generating from the end of each prompt simultaneously. Left-padding aligns the prompt ends, which makes batched decoding much simpler — but it forces you to pair the left-pad with an attention mask that hides the PAD tokens. Frameworks like vLLM and HF generate handle this for you; if you are rolling your own batched decoder, this is one of the standard tripwires.

The toy in 01_sft.py sidesteps the whole issue by making every example exactly the same length — fixed four-letter words give a fixed sequence length, so no padding is needed. Real datasets are not so cooperative; the right approach there is bucketing by length plus right-pad-with-mask within a bucket.

P(text) → P(response | prompt), pictured

SFT does not throw away pretraining; it carves a slice out of it. Pretraining gave the model a distribution over all text. SFT picks out the (vast minority) subset of strings that happen to be formatted as user/assistant turns, and it uses prompt-masking to point the loss only at the response side of those strings. Everything else the model knows — grammar, vocabulary, world facts, the shape of natural English — is left intact.

P(text) — what pretraining models chat-formatted strings only (prompt + response, with delimiters) response positions SFT data ~tiny subset of corpus gradient lands here via prompt mask universe of text books, code, web, worksheets, dialogues, ... SFT = restrict to the conditional slice and point gradient at the response side

Stated more carefully: at the level of probability, SFT is finding parameters that put high mass on a particular conditional distribution. Bayes' rule says P(response | prompt) = P(prompt, response) / P(prompt). Pretraining was implicitly fitting the numerator on all text. SFT shifts the optimization to fit the numerator only on prompt-formatted text and — by virtue of the mask — fit it only at positions that produce the response tokens. The denominator P(prompt) is exactly what we are refusing to fit, because that would push the model to learn how to generate prompts, which is opposite to the goal.

Why we keep πref-free here

In DPO and RLVR (later lessons) we will introduce a frozen reference policy πref — typically a copy of the model's weights at the start of post-training. The training loss then includes a KL term KL(πθ ‖ πref) that anchors the trained policy to the reference, preventing it from drifting too far during reward-driven updates.

SFT has no such term. There is no πref, no anchor, no KL penalty. Why?

Because SFT does not have a runaway-drift failure mode. The loss is bounded and well-conditioned: cross-entropy on a fixed, supervised dataset of (prompt, response) pairs cannot exploit a reward signal because there is no reward signal — every gradient term is matching a specific labeled next-token. The "worst case" of SFT going wrong is overfitting to the training distribution, which is qualitatively different from policy collapse onto a reward exploit.

More to the point: drift is the point of SFT. We want the model to move away from the pure-text-completer behavior toward the instruction-follower behavior. Anchoring to πref = pretrained-model would defeat the purpose. The pretrained model is where we are leaving from, not what we are pulling back to.

The reference re-enters in DPO and RLVR for a specific reason: those losses involve a reward (explicit or implicit). Reward-driven losses can drive the policy to degenerate places — repeated tokens, prompt-injection-style exploits, mode collapse onto a single trick — and the KL anchor is what makes the math finite and the behavior stable. SFT has no reward, so no anchor.

What SFT cannot do — the gap that DPO will fill

SFT needs a single gold response per prompt. That is its entire contract: show me an instruction and the correct answer, repeatedly, and I will teach the model to produce that answer. Three consequences follow.

First, SFT cannot consume preferences. If you have raters who saw two model responses A and B and said "A is better than B", there is no place to put that information in the SFT loss. You could try to teach on A and ignore B, but then you have thrown away the relative judgment — you do not know whether A is great or merely the lesser evil. SFT is purely absolute, not relative.

Second, SFT overfits on tasks with multiple acceptable answers. "Write a haiku about autumn" has thousands of correct responses. The SFT dataset will have one. The model will learn to produce that one — and slight variations of it — and will be penalized whenever it produces a different but equally valid alternative. This is the well-known "imitation collapse" failure mode of pure SFT.

Third, SFT cannot improve past the labeler. The model's ceiling is the response set the demonstrators wrote. If a demonstrator wrote a slightly suboptimal answer, that suboptimality is now the training target. There is no mechanism for the model to discover that a different answer would have been better.

DPO addresses the first; RLVR addresses the third (and partially the second). We will come back to this.

Pretraining vs SFT in one line

The compressed version
Pretraining learns P(text). SFT restricts attention to P(response | prompt) for the subset of text that happens to be formatted as user/assistant turns. All knowledge already lived in the weights — SFT installs a reliable dispatcher.

The corollary is that SFT's compute requirements are dramatically smaller than pretraining's. Pretraining is teaching everything — grammar, syntax, vocabulary, world knowledge, code, math. SFT is only teaching the routing layer: how to recognize "this is a request" and respond accordingly. You can SFT a 7B model usefully in hours on hardware that would take months to pretrain it on. The asymmetry is by design.

Trade-offs at every design choice

Token budget: 60–90% of positions get zero gradient

In our toy, a four-letter reversal example has P=6 prompt tokens (e.g. <abcd>) and R=5 response tokens (e.g. dcba#). After the shift-by-one, we have T-1 = 10 training positions, of which only 5 receive gradient. That's 50% utilization — and in this toy the prompt is short. In a realistic chat fine-tune the prompt might be 200 tokens and the response 50 tokens, giving you under 20% utilization. You are throwing away 80% of your gradient terms.

This is accepted, deliberately, because the alternative — training on the prompt tokens — actively harms the model as the ablation shows. Empty gradient terms cost a forward pass's worth of compute but produce zero (and zero is better than negative). In practice this is one reason SFT runs are typically shorter than pretraining: you need fewer optimizer steps when each step is more on-task.

If you want to recover that utilization, the right move is packing: concatenate multiple examples into a single training sequence (separated by EOS or document boundaries) so that within one fixed-length batch row you are processing several prompt-response pairs at once. This is what production fine-tunes do. The mechanics are unchanged — you mask each prompt's positions in each packed example — but the bookkeeping is fiddlier.

Single-turn vs multi-turn templates

Our toy template handles exactly one user turn and one assistant turn. Real chats alternate: user, assistant, user, assistant, ... A correct multi-turn template extends the same idea — wrap each turn in its delimiters, mask every position that is not inside an assistant turn — but you now have several response regions in a single sequence and the mask has to thread through all of them.

The simplest implementation is to compute one boolean mask over the full sequence: True at positions whose target is inside any assistant turn, False elsewhere. Apply IGNORE_INDEX to all False positions. Conceptually identical to the single-turn case, mechanically slightly more code.

One subtle decision in multi-turn SFT is whether earlier assistant turns contribute to the loss when training on later turns. Almost always: yes. The model needs to learn what an "assistant turn following the previous turns" looks like, and that includes producing whatever the conversation history says it produced.

Same-length vs variable-length batches

The toy uses fixed-length examples to keep batching trivial. Real datasets do not, and you now have a choice: pad to the longest example in the batch, or bucket-then-pad within buckets, or pack as described above. Padding wastes compute proportional to the length-variance of your batch; packing wastes none but requires per-example mask construction. The middle road — length-bucketed batches with modest padding — is what most production fine-tuners use because it is simple and the overhead is small.

If you do pad: right-pad, and mask the PAD targets with IGNORE_INDEX in addition to the prompt masking. The pad-mask and the prompt-mask compose by AND.

Learning rate and steps

SFT typically uses a smaller learning rate than pretraining (often 1–10x smaller) because the model is already in a good place and you do not want to overwrite generic knowledge. The toy uses lr=3e-4 for both because the model is tiny and the task is narrow; in a production run you would set SFT around 1e-5 for a 7B model and an order of magnitude smaller for larger checkpoints. Fewer steps too — you are not learning a new language, just installing a behavior pattern.

Forward link to CoT

Lesson 4 is going to feel anticlimactic, and that is the point. The next stage, chain-of-thought SFT, uses the same loss, the same mask, the same optimizer, the same architecture. The only thing that changes is what the response text contains. Instead of an example like:

<3+4+5>12#

we will train on examples like:

<3+4+5>{3+4=7;7+5=12}12#

and watch the model learn to write its reasoning out loud before producing the final answer. The mask still ignores the prompt; the model still emits up to #; the cross-entropy loss is still summed over response positions only. The reasoning trace just lives inside the response.

Why this works is a separate and beautiful question — it has to do with the fact that each emitted token is a new forward pass, so longer responses give the model more serial compute. We will get there.

Takeaway
SFT is the smallest possible change that turns a text completer into an instruction follower. One mask line. One chat template. The architecture, loss shape, optimizer, and vocabulary are inherited from pretraining unchanged. The model's knowledge does not change — what changes is the model's routing: where it points its probability mass when the input is shaped like a request. Everything later in this series will preserve this loss machinery and change exactly one other thing.