data_engineering / 01 · data by regime lesson 1 / 11

The data, by regime

Before you build a pipeline you need to know exactly what it must produce. Each of the three post-training regimes defines a different unit of data with a different schema, a different loss mask, and different constraints on where it comes from and how it is generated.

Where we are
Lesson 00 (orientation) introduced the three regimes and the medallion shape. This lesson zooms into the record level: what does one training example actually look like for each regime, what fields does it carry, and which tokens are masked for the loss. Everything downstream — ingestion, storage, dedup, tokenization — is shaped by these units.

SFT: the (prompt, response) pair

Supervised fine-tuning teaches the model to imitate a target response. The unit is a (prompt, response) pair. In practice almost nothing is stored as a flat string; instead the record is a list of chat messages, each with a role and content, that a chat template (Jinja2, typically) renders into a single token sequence before training.

  SFT record (multi-turn)
  ┌─────────────────────────────────────────────────┐
  │ messages: [                                      │
  │   { role: "system",    content: "You are …" }   │
  │   { role: "user",      content: "Q1" }          │
  │   { role: "assistant", content: "A1" }  ← LOSS  │
  │   { role: "user",      content: "Q2" }          │
  │   { role: "assistant", content: "A2" }  ← LOSS  │
  │ ]                                                │
  └─────────────────────────────────────────────────┘
  Loss is computed ONLY on assistant tokens.
  System + user tokens are masked out (label = -100).

Two properties dominate SFT data quality:

Forward reference: loss masking (lesson 07)
The label = -100 convention signals PyTorch's cross-entropy to skip those positions when accumulating the loss. The pipeline stage that tokenizes the record must emit both input_ids and a matching labels tensor with all non-assistant positions zeroed out. Getting this wrong (masking the wrong span, or masking nothing) is silent — the loss still runs, just on the wrong tokens. Lesson 07 covers this in full; keep it in mind as you design the schema.

Preference: the (prompt, chosen, rejected) triple

Preference learning — DPO, IPO, KTO, reward modelling for RLHF — does not need a single "correct" response. It needs a comparison: given the same prompt, which response is better? The unit is a (prompt, chosen, rejected) triple, where both chosen and rejected are full response texts (or message lists).

  Preference record
  ┌─────────────────────────────────────────────────┐
  │ prompt:   "Explain gradient descent"             │
  │ chosen:   "Gradient descent is an iterative …"  │
  │ rejected: "It's basically just going downhill"   │
  │ source:   "human" | "ai" | "synthetic"           │
  │ score_chosen:   4.2   (optional, continuous)     │
  │ score_rejected: 1.8   (optional)                 │
  └─────────────────────────────────────────────────┘

Three things to know about the source of preference labels:

Ties and the collapsed pair problem
When a rater cannot distinguish two responses, the pair is a tie. Ties are not neutral — including them as training signal (randomly assigning chosen/rejected) adds label noise. Exclude ties or use a model that handles soft labels. Similarly, if your pipeline deduplicates by prompt and accidentally drops one side of a pair, the remaining record is corrupted and must be removed entirely.

RL / RLVR: the (prompt, verifier) seed

Reinforcement learning over verifiable rewards (RLVR) — the approach behind DeepSeek-R1, GRPO, and similar methods — looks like the simplest dataset: just a prompt and a way to score the model's answer. But it is architecturally the most different from SFT/preference, because the responses do not live in the dataset at all.

  RL / RLVR record (the "seed")
  ┌─────────────────────────────────────────────────┐
  │ prompt:    "If 3x + 7 = 22, what is x?"         │
  │ answer:    "5"          ← gold, for verifier     │
  │ verifier:  "exact_match" | "python_eval" | …    │
  │ domain:    "math/algebra"                        │
  │ difficulty: 0.42                                 │
  └─────────────────────────────────────────────────┘
  Responses are NOT stored here.
  The policy generates them LIVE during training.

At training time the policy samples K responses per prompt, a verifier scores each one, and the scores become rewards that update the policy. The "dataset" is really a prompt bank — a curated collection of seeds that are diverse, appropriately difficult, and unambiguously verifiable. The actual training data is the online rollout stream, not the bank.

Why a static RL dataset doesn't exist in the SFT sense
You cannot pre-generate the RL training responses and store them. The point of RL is that the responses change as the policy improves — on-policy data is required. Serving pre-generated responses would be behavioural cloning back to SFT. The data-engineering implication: the RL pipeline must be a streaming loop, not a batch job. Lesson 10 (the RL online dataplane) covers this architecture in full.

One schema, three projections

All three record types share enough structure that it pays to define a single canonical schema and treat SFT, preference, and RL as projections — subsets of the shared fields. This matters for storage, for quality gates that run across all regimes, and for auditability.

FieldSFTPreferenceRL
idrequiredrequiredrequired
sourcerequiredrequiredrequired
messages / promptmessages[]prompt stringprompt string
responselast assistant turn— (generated live)
chosen / rejectedrequired
answer / verifierrequired
langrecommendedrecommendedrecommended
difficultyoptionaloptionalrecommended
token_countrecommendedrecommendedoptional
licenserequiredrequiredrequired
originrequiredrequiredrequired
created_atrecommendedrecommendedrecommended

Storing all records in the same Parquet table (with null values for unused fields per regime) lets you run one dedup pass, one quality gate, and one provenance audit across the whole corpus before splitting into regime-specific gold tables. The regime split is a late, cheap operation; the cleaning is early and shared.

Regime comparison

PropertySFTPreferenceRL / RLVR
Unit(prompt, response)(prompt, chosen, rejected)(prompt, verifier/answer)
Typical sourceHuman demos, distillation, syntheticHuman or AI pairwise judgmentsCurated problem sets (math, code, logic)
Loss masked onAll non-assistant tokensImplicitly via DPO loss; chosen/rejected responses both usedNo static loss — reward signal from verifier
Static vs dynamicStatic — built once, trained onStatic — built once, trained onDynamic — prompt bank is static; responses generated live
Scale sensitivityHigh quality > high volumeTie rate + length bias are the main hazardsPrompt diversity + verifier quality dominate
Key pipeline concernLoss-mask correctness; multi-turn formattingDeduplication must keep pairs together; length-bias auditPrompt bank must be deduped; verifier must be deterministic

Interactive · record schema explorer

Toggle the regime to see the JSON-ish structure of one training record and which spans are masked for the loss (shown in colour). This is the shape your pipeline must emit into the gold layer.

Record schema by regime
Select a regime. The highlighted fields are present in that record; grey fields are null or absent. The loss-mask indicator shows which tokens a trainer would compute cross-entropy on.
Static?
Loss computed on
Main quality risk
Typical size

The dataset is the product

A phrase that recurs in practitioner writing on post-training: "the dataset is the product." It means that for the same model architecture and training recipe, the data is the primary lever on output quality — and that the relationship is not linear with volume.

Cross-link: the curation and signal angle
The orthogonal question — which examples to keep to maximise gradient signal — is answered in the companion RL series: data curation and pipelines (lesson 18a). This series answers how to build the pipeline that processes and serves those decisions. Read them together.
Takeaway — what to carry to lesson 02
Three regimes, three record shapes, one canonical schema. SFT and preference are static datasets; RL is a prompt bank fed to a live generation loop. The loss mask is not a training detail — it is a pipeline output that must be correct at the schema level. Lesson 02 introduces the ETL / ELT skeleton that turns raw versions of these records (bronze) into clean, validated, training-ready gold — the same shape for all three regimes.