Tokenization & packing
Lesson 06 gave us a clean, deduplicated silver dataset. This lesson is the gold-layer transform: run a tokenizer over every example, apply loss masks, then pack the resulting token arrays into dense, context-length sequences the trainer consumes with zero wasted compute.
Tokenization as a pipeline stage
Tokenization is a narrow / map operation (lesson 05's taxonomy): each example is converted independently, with no cross-example state. That means it parallelizes perfectly across partitions and workers — double the cluster, halve the time, no shuffle required.
The catch is determinism. A tokenizer is not just a function — it is a specific version of a vocabulary file plus encoding logic. If you pin neither, two runs of the gold layer can produce different token-id arrays from the same text, silently breaking reproducibility (lesson 02). The rule:
- Pin the tokenizer artifact by hash, not by name.
tokenizer == "Llama-3"is ambiguous;sha256 == a3f…is not. - Record the tokenizer version in the Parquet metadata of every gold file — so any reader can verify what produced the file without reading the pipeline code.
- Special tokens (
<|im_start|>,<|eot_id|>, etc.) are part of the tokenizer contract. A mismatch here corrupts the chat template silently.
In Parquet the output is an integer array column (token_ids: list<int32>) and a boolean / uint8 column (loss_mask: list<uint8>). Both compress very well under Zstd — integer sequences have low entropy and high repetition. Storing them columnar means the DataLoader reads only these two columns (not the raw text) at train time, which is the column-projection benefit from lesson 04.
Loss masking
Each token carries a binary loss mask: 1 means "compute cross-entropy loss here", 0 means "skip this token." The mask is not an afterthought — it determines what the model actually learns.
| Regime | What gets mask = 1 | What gets mask = 0 |
|---|---|---|
| SFT | Assistant / response tokens | System prompt, user turn, padding |
| Preference (DPO) | Response tokens in both chosen and rejected | Shared prompt prefix, padding |
| Tool-use / agentic | Model-generated text and function calls | Tool observation / environment text the model did not generate |
Packing (below) changes the attention mask and position IDs across document boundaries — it does not change these per-document loss-mask rules. Boundary / BOS tokens introduced by packing remain unmasked only if they would have been unmasked in the original example.
For SFT the loss mask is derived from the chat template structure: everything between <|im_start|>system / <|im_start|>user and the corresponding <|im_end|> is masked; only the assistant span is unmasked. For agentic / tool-use data, observation tokens — text produced by the environment, not by the model — must be masked too. Those are the "steps the model did not take" and training on them as if they were model output corrupts the policy. See the agentic RL lesson for how this interacts with multi-turn rollout structure.
SFT example · token sequence with mask:
[SYS] You are a coding assistant. [/SYS] [USER] Fix this bug. [/USER] [ASST] Here is the fix: ... [/ASST]
──────────────── mask = 0 ───────────────────────────────────────── ───── mask = 1 ──────────────────
Agentic example:
[ASST] <tool_call>search("foo")</tool_call> [OBS] {"result":"bar"} [/OBS] [ASST] The answer is bar.
──────────── mask = 1 ───────────────────── ──────── mask = 0 ──────────── ─────── mask = 1 ──────
Preference data tokenizes both branches (chosen, rejected) as separate sequences and applies the same prompt-masking rule to both. A common implementation error is to forget to mask the shared prefix in the rejected branch — the model then receives gradient signal telling it to reduce the log-prob of the prompt itself.
Sequence packing: eliminating padding waste
Most post-training examples are short. An SFT example might be 200–800 tokens; a math chain-of-thought might be 2 000. But the model's context window L is 4 096, 8 192, or 131 072 tokens. If you allocate one sequence per example and pad to L, the padding fraction is:
padding waste = 1 − mean_length / L
At L = 8192 and mean length 512, you waste 94% of every sequence. That means 94% of the GPU's attention FLOP and memory bandwidth serve padding tokens — tokens that contribute zero gradient. You are paying full training cost to process noise.
Sequence packing solves this by concatenating multiple examples end-to-end into one length-L sequence:
Padded (one example per sequence, L = 32): seq 1: [doc_A ─ 10 tok ─][padding ── 22 tok ────────────────] seq 2: [doc_B ── 14 tok ────][padding ──── 18 tok ────────────] seq 3: [doc_C ── 8 tok ─][padding ──── 24 tok ─────────────────] Packed (multiple examples per sequence, L = 32): seq 1: [doc_A ─ 10 tok ─][doc_B ── 14 tok ────][doc_D ─ 8 tok ─] seq 2: [doc_C ── 8 tok ─][doc_E ── 12 tok ─────][doc_F ─ 7 tok][pad 5] seq 3: [doc_G ── 20 tok ─────────][doc_H ─ 12 tok ──────────────]
Packing raises utilization from mean_length / L toward ~100%. Assembling sequences is a bin-packing problem: each example is an item of size len_i, each sequence is a bin of capacity L. First-fit decreasing (FFD) — sort examples longest-first, greedily assign each to the first bin that still fits — achieves near-optimal utilization in linear time and is the standard approach. Best-fit decreasing gets marginally better fill at higher CPU cost and is rarely worth it at scale.
i in document B attends to and is predicted given document A's context, which is wrong: it inflates log-probabilities for tokens that happen to follow a specific unrelated document, and injects spurious gradient signal. The fix is a block-diagonal attention mask (sometimes called "packed attention" or "document attention"): each document attends only to itself. Paired with per-document position IDs that reset to 0 at each document boundary (so positional encodings are locally coherent), this is the correct packing implementation. Flash Attention's varlen interface (cu_seqlens) implements this exactly. Without it, packing is not a free lunch — it is a training bug at scale.
For RL, packing shows up again in rollout batches: variable-length trajectories are packed into fixed-size batches for the value / advantage computation. The same correctness constraints apply — see RL lesson 22b on long-tail rollout handling.
The throughput math
Define:
- N = number of examples in the dataset
- L = context length (sequence length)
- μ = mean example length in tokens
Without packing, the number of training sequences is N (one per example), total tokens processed = N × L, of which N × (L − μ) are padding. Padding efficiency = μ / L.
With packing, the number of sequences shrinks to roughly N × μ / L (approximately — bin-packing incurs small overhead). Total real tokens processed ≈ N × μ. The speedup from packing is approximately L / μ — the inverse of the padding efficiency. At L = 8192 and μ = 512 that is a 16× reduction in sequences and thus a ~16× throughput gain for identical gradient signal.
Interactive · packing-efficiency simulator
Set the context length, mean example length, and variance. Compare padded (one example per sequence) vs packed (bin-packed to fill the context window). Observe the padding waste, effective tokens, and the throughput multiplier.