The data, by regime

Before you build a pipeline you need to know exactly what it must produce. Each of the three post-training regimes defines a different unit of data with a different schema, a different loss mask, and different constraints on where it comes from and how it is generated.

Where we are

Lesson 00 (orientation) introduced the three regimes and the medallion shape. This lesson zooms into the record level: what does one training example actually look like for each regime, what fields does it carry, and which tokens are masked for the loss. Everything downstream — ingestion, storage, dedup, tokenization — is shaped by these units.

SFT: the (prompt, response) pair

Supervised fine-tuning teaches the model to imitate a target response. The unit is a (prompt, response) pair. In practice almost nothing is stored as a flat string; instead the record is a list of chat messages, each with a role and content, that a chat template (Jinja2, typically) renders into a single token sequence before training.

  SFT record (multi-turn)
  ┌─────────────────────────────────────────────────┐
  │ messages: [                                      │
  │   { role: "system",    content: "You are …" }   │
  │   { role: "user",      content: "Q1" }          │
  │   { role: "assistant", content: "A1" }  ← LOSS  │
  │   { role: "user",      content: "Q2" }          │
  │   { role: "assistant", content: "A2" }  ← LOSS  │
  │ ]                                                │
  └─────────────────────────────────────────────────┘
  Loss is computed ONLY on assistant tokens.
  System + user tokens are masked out (label = -100).

Two properties dominate SFT data quality:

Turn depth. A single exchange (user → assistant) is a degenerate case. Real instruction data includes multi-turn dialogues, which forces the model to track context. Single-turn data creates single-turn models.
Response quality. The model learns to match the distribution of responses it sees. One bad-faith, over-hedged, or factually wrong response in the training set is worth far less than zero — it actively teaches the wrong behaviour. Correctness and stylistic consistency matter; volume is secondary.

Forward reference: loss masking (lesson 07)

The label = -100 convention signals PyTorch's cross-entropy to skip those positions when accumulating the loss. The pipeline stage that tokenizes the record must emit both input_ids and a matching labels tensor with all non-assistant positions zeroed out. Getting this wrong (masking the wrong span, or masking nothing) is silent — the loss still runs, just on the wrong tokens. Lesson 07 covers this in full; keep it in mind as you design the schema.

Preference: the (prompt, chosen, rejected) triple

Preference learning — DPO, IPO, KTO, reward modelling for RLHF — does not need a single "correct" response. It needs a comparison: given the same prompt, which response is better? The unit is a (prompt, chosen, rejected) triple, where both chosen and rejected are full response texts (or message lists).

  Preference record
  ┌─────────────────────────────────────────────────┐
  │ prompt:   "Explain gradient descent"             │
  │ chosen:   "Gradient descent is an iterative …"  │
  │ rejected: "It's basically just going downhill"   │
  │ source:   "human" | "ai" | "synthetic"           │
  │ score_chosen:   4.2   (optional, continuous)     │
  │ score_rejected: 1.8   (optional)                 │
  └─────────────────────────────────────────────────┘

Three things to know about the source of preference labels:

Human annotation (RLHF). Expensive, slow, hard to scale, but high-signal on subtle quality dimensions (tone, safety, factuality). Requires careful annotation guidelines and inter-annotator agreement checks.
RLAIF (AI feedback). A stronger LLM plays the role of annotator. Faster and cheaper, but biases the data toward the judge model's own stylistic preferences — if the judge likes verbose answers, the dataset will systematically prefer verbose answers.
Length bias. Annotators (human and AI) consistently prefer longer responses, all else equal. This is one of the most well-documented artefacts in preference data: a 500-token response beats a 100-token response on the same question, even when the short one is more accurate. Pipelines should log response lengths and check that the chosen/rejected split is not confounded with length.

Ties and the collapsed pair problem

When a rater cannot distinguish two responses, the pair is a tie. Ties are not neutral — including them as training signal (randomly assigning chosen/rejected) adds label noise. Exclude ties or use a model that handles soft labels. Similarly, if your pipeline deduplicates by prompt and accidentally drops one side of a pair, the remaining record is corrupted and must be removed entirely.

RL / RLVR: the (prompt, verifier) seed

Reinforcement learning over verifiable rewards (RLVR) — the approach behind DeepSeek-R1, GRPO, and similar methods — looks like the simplest dataset: just a prompt and a way to score the model's answer. But it is architecturally the most different from SFT/preference, because the responses do not live in the dataset at all.

  RL / RLVR record (the "seed")
  ┌─────────────────────────────────────────────────┐
  │ prompt:    "If 3x + 7 = 22, what is x?"         │
  │ answer:    "5"          ← gold, for verifier     │
  │ verifier:  "exact_match" | "python_eval" | …    │
  │ domain:    "math/algebra"                        │
  │ difficulty: 0.42                                 │
  └─────────────────────────────────────────────────┘
  Responses are NOT stored here.
  The policy generates them LIVE during training.

At training time the policy samples K responses per prompt, a verifier scores each one, and the scores become rewards that update the policy. The "dataset" is really a prompt bank — a curated collection of seeds that are diverse, appropriately difficult, and unambiguously verifiable. The actual training data is the online rollout stream, not the bank.

Why a static RL dataset doesn't exist in the SFT sense

You cannot pre-generate the RL training responses and store them. The point of RL is that the responses change as the policy improves — on-policy data is required. Serving pre-generated responses would be behavioural cloning back to SFT. The data-engineering implication: the RL pipeline must be a streaming loop, not a batch job. Lesson 10 (the RL online dataplane) covers this architecture in full.

One schema, three projections

All three record types share enough structure that it pays to define a single canonical schema and treat SFT, preference, and RL as projections — subsets of the shared fields. This matters for storage, for quality gates that run across all regimes, and for auditability.

Field	SFT	Preference	RL
`id`	required	required	required
`source`	required	required	required
`messages` / `prompt`	`messages[]`	`prompt` string	`prompt` string
`response`	last assistant turn	—	— (generated live)
`chosen` / `rejected`	—	required	—
`answer` / `verifier`	—	—	required
`lang`	recommended	recommended	recommended
`difficulty`	optional	optional	recommended
`token_count`	recommended	recommended	optional
`license`	required	required	required
`origin`	required	required	required
`created_at`	recommended	recommended	recommended

Storing all records in the same Parquet table (with null values for unused fields per regime) lets you run one dedup pass, one quality gate, and one provenance audit across the whole corpus before splitting into regime-specific gold tables. The regime split is a late, cheap operation; the cleaning is early and shared.

Regime comparison

Property	SFT	Preference	RL / RLVR
Unit	(prompt, response)	(prompt, chosen, rejected)	(prompt, verifier/answer)
Typical source	Human demos, distillation, synthetic	Human or AI pairwise judgments	Curated problem sets (math, code, logic)
Loss masked on	All non-assistant tokens	Implicitly via DPO loss; chosen/rejected responses both used	No static loss — reward signal from verifier
Static vs dynamic	Static — built once, trained on	Static — built once, trained on	Dynamic — prompt bank is static; responses generated live
Scale sensitivity	High quality > high volume	Tie rate + length bias are the main hazards	Prompt diversity + verifier quality dominate
Key pipeline concern	Loss-mask correctness; multi-turn formatting	Deduplication must keep pairs together; length-bias audit	Prompt bank must be deduped; verifier must be deterministic

Interactive · record schema explorer

Toggle the regime to see the JSON-ish structure of one training record and which spans are masked for the loss (shown in colour). This is the shape your pipeline must emit into the gold layer.

The dataset is the product

A phrase that recurs in practitioner writing on post-training: "the dataset is the product." It means that for the same model architecture and training recipe, the data is the primary lever on output quality — and that the relationship is not linear with volume.

Small-and-clean beats large-and-noisy. LIMA (2023) showed that 1 000 carefully curated SFT examples could match or exceed the instruction-following quality of models trained on orders-of-magnitude more data. The bottleneck was not volume; it was the quality distribution of the response side.
Diversity dominates at the margin. Once you have enough examples of a format or task type, adding more of the same hurts more than it helps (overfitting to the represented modes, under-representing others). The marginal value of a new example is its informational distance from examples already in the set — which is exactly why dedup and diversity sampling (lesson 06, lesson 08) are first-class pipeline stages, not afterthoughts.
Difficulty is a dial. For RL / RLVR, the expected gradient signal from a prompt is maximised at intermediate difficulty: if the model already gets it right every time the reward is always 1 and there's nothing to learn; if it never gets it right the reward is always 0 and there's still nothing to learn. The pipeline must track and actively curate prompt difficulty — another reason difficulty is a first-class schema field.

Cross-link: the curation and signal angle

The orthogonal question — which examples to keep to maximise gradient signal — is answered in the companion RL series: data curation and pipelines (lesson 18a). This series answers how to build the pipeline that processes and serves those decisions. Read them together.

Takeaway — what to carry to lesson 02

Three regimes, three record shapes, one canonical schema. SFT and preference are static datasets; RL is a prompt bank fed to a live generation loop. The loss mask is not a training detail — it is a pipeline output that must be correct at the schema level. Lesson 02 introduces the ETL / ELT skeleton that turns raw versions of these records (bronze) into clean, validated, training-ready gold — the same shape for all three regimes.