The data, by regime
Before you build a pipeline you need to know exactly what it must produce. Each of the three post-training regimes defines a different unit of data with a different schema, a different loss mask, and different constraints on where it comes from and how it is generated.
SFT: the (prompt, response) pair
Supervised fine-tuning teaches the model to imitate a target response. The unit is a (prompt, response) pair. In practice almost nothing is stored as a flat string; instead the record is a list of chat messages, each with a role and content, that a chat template (Jinja2, typically) renders into a single token sequence before training.
SFT record (multi-turn)
┌─────────────────────────────────────────────────┐
│ messages: [ │
│ { role: "system", content: "You are …" } │
│ { role: "user", content: "Q1" } │
│ { role: "assistant", content: "A1" } ← LOSS │
│ { role: "user", content: "Q2" } │
│ { role: "assistant", content: "A2" } ← LOSS │
│ ] │
└─────────────────────────────────────────────────┘
Loss is computed ONLY on assistant tokens.
System + user tokens are masked out (label = -100).
Two properties dominate SFT data quality:
- Turn depth. A single exchange (
user → assistant) is a degenerate case. Real instruction data includes multi-turn dialogues, which forces the model to track context. Single-turn data creates single-turn models. - Response quality. The model learns to match the distribution of responses it sees. One bad-faith, over-hedged, or factually wrong response in the training set is worth far less than zero — it actively teaches the wrong behaviour. Correctness and stylistic consistency matter; volume is secondary.
label = -100 convention signals PyTorch's cross-entropy to skip those positions when accumulating the loss. The pipeline stage that tokenizes the record must emit both input_ids and a matching labels tensor with all non-assistant positions zeroed out. Getting this wrong (masking the wrong span, or masking nothing) is silent — the loss still runs, just on the wrong tokens. Lesson 07 covers this in full; keep it in mind as you design the schema.
Preference: the (prompt, chosen, rejected) triple
Preference learning — DPO, IPO, KTO, reward modelling for RLHF — does not need a single "correct" response. It needs a comparison: given the same prompt, which response is better? The unit is a (prompt, chosen, rejected) triple, where both chosen and rejected are full response texts (or message lists).
Preference record ┌─────────────────────────────────────────────────┐ │ prompt: "Explain gradient descent" │ │ chosen: "Gradient descent is an iterative …" │ │ rejected: "It's basically just going downhill" │ │ source: "human" | "ai" | "synthetic" │ │ score_chosen: 4.2 (optional, continuous) │ │ score_rejected: 1.8 (optional) │ └─────────────────────────────────────────────────┘
Three things to know about the source of preference labels:
- Human annotation (RLHF). Expensive, slow, hard to scale, but high-signal on subtle quality dimensions (tone, safety, factuality). Requires careful annotation guidelines and inter-annotator agreement checks.
- RLAIF (AI feedback). A stronger LLM plays the role of annotator. Faster and cheaper, but biases the data toward the judge model's own stylistic preferences — if the judge likes verbose answers, the dataset will systematically prefer verbose answers.
- Length bias. Annotators (human and AI) consistently prefer longer responses, all else equal. This is one of the most well-documented artefacts in preference data: a 500-token response beats a 100-token response on the same question, even when the short one is more accurate. Pipelines should log response lengths and check that the chosen/rejected split is not confounded with length.
RL / RLVR: the (prompt, verifier) seed
Reinforcement learning over verifiable rewards (RLVR) — the approach behind DeepSeek-R1, GRPO, and similar methods — looks like the simplest dataset: just a prompt and a way to score the model's answer. But it is architecturally the most different from SFT/preference, because the responses do not live in the dataset at all.
RL / RLVR record (the "seed") ┌─────────────────────────────────────────────────┐ │ prompt: "If 3x + 7 = 22, what is x?" │ │ answer: "5" ← gold, for verifier │ │ verifier: "exact_match" | "python_eval" | … │ │ domain: "math/algebra" │ │ difficulty: 0.42 │ └─────────────────────────────────────────────────┘ Responses are NOT stored here. The policy generates them LIVE during training.
At training time the policy samples K responses per prompt, a verifier scores each one, and the scores become rewards that update the policy. The "dataset" is really a prompt bank — a curated collection of seeds that are diverse, appropriately difficult, and unambiguously verifiable. The actual training data is the online rollout stream, not the bank.
One schema, three projections
All three record types share enough structure that it pays to define a single canonical schema and treat SFT, preference, and RL as projections — subsets of the shared fields. This matters for storage, for quality gates that run across all regimes, and for auditability.
| Field | SFT | Preference | RL |
|---|---|---|---|
id | required | required | required |
source | required | required | required |
messages / prompt | messages[] | prompt string | prompt string |
response | last assistant turn | — | — (generated live) |
chosen / rejected | — | required | — |
answer / verifier | — | — | required |
lang | recommended | recommended | recommended |
difficulty | optional | optional | recommended |
token_count | recommended | recommended | optional |
license | required | required | required |
origin | required | required | required |
created_at | recommended | recommended | recommended |
Storing all records in the same Parquet table (with null values for unused fields per regime) lets you run one dedup pass, one quality gate, and one provenance audit across the whole corpus before splitting into regime-specific gold tables. The regime split is a late, cheap operation; the cleaning is early and shared.
Regime comparison
| Property | SFT | Preference | RL / RLVR |
|---|---|---|---|
| Unit | (prompt, response) | (prompt, chosen, rejected) | (prompt, verifier/answer) |
| Typical source | Human demos, distillation, synthetic | Human or AI pairwise judgments | Curated problem sets (math, code, logic) |
| Loss masked on | All non-assistant tokens | Implicitly via DPO loss; chosen/rejected responses both used | No static loss — reward signal from verifier |
| Static vs dynamic | Static — built once, trained on | Static — built once, trained on | Dynamic — prompt bank is static; responses generated live |
| Scale sensitivity | High quality > high volume | Tie rate + length bias are the main hazards | Prompt diversity + verifier quality dominate |
| Key pipeline concern | Loss-mask correctness; multi-turn formatting | Deduplication must keep pairs together; length-bias audit | Prompt bank must be deduped; verifier must be deterministic |
Interactive · record schema explorer
Toggle the regime to see the JSON-ish structure of one training record and which spans are masked for the loss (shown in colour). This is the shape your pipeline must emit into the gold layer.
The dataset is the product
A phrase that recurs in practitioner writing on post-training: "the dataset is the product." It means that for the same model architecture and training recipe, the data is the primary lever on output quality — and that the relationship is not linear with volume.
- Small-and-clean beats large-and-noisy. LIMA (2023) showed that 1 000 carefully curated SFT examples could match or exceed the instruction-following quality of models trained on orders-of-magnitude more data. The bottleneck was not volume; it was the quality distribution of the response side.
- Diversity dominates at the margin. Once you have enough examples of a format or task type, adding more of the same hurts more than it helps (overfitting to the represented modes, under-representing others). The marginal value of a new example is its informational distance from examples already in the set — which is exactly why dedup and diversity sampling (lesson 06, lesson 08) are first-class pipeline stages, not afterthoughts.
- Difficulty is a dial. For RL / RLVR, the expected gradient signal from a prompt is maximised at intermediate difficulty: if the model already gets it right every time the reward is always 1 and there's nothing to learn; if it never gets it right the reward is always 0 and there's still nothing to learn. The pipeline must track and actively curate prompt difficulty — another reason
difficultyis a first-class schema field.