Tokenization & packing

Lesson 06 gave us a clean, deduplicated silver dataset. This lesson is the gold-layer transform: run a tokenizer over every example, apply loss masks, then pack the resulting token arrays into dense, context-length sequences the trainer consumes with zero wasted compute.

Where we are

Silver is text. Gold is token-id arrays stored as typed integer columns in Parquet (lesson 04's columnar layout), ready for the DataLoader to mmap directly without re-tokenizing at train time. This lesson builds that transformation.

Tokenization as a pipeline stage

Tokenization is a narrow / map operation (lesson 05's taxonomy): each example is converted independently, with no cross-example state. That means it parallelizes perfectly across partitions and workers — double the cluster, halve the time, no shuffle required.

The catch is determinism. A tokenizer is not just a function — it is a specific version of a vocabulary file plus encoding logic. If you pin neither, two runs of the gold layer can produce different token-id arrays from the same text, silently breaking reproducibility (lesson 02). The rule:

Pin the tokenizer artifact by hash, not by name. tokenizer == "Llama-3" is ambiguous; sha256 == a3f… is not.
Record the tokenizer version in the Parquet metadata of every gold file — so any reader can verify what produced the file without reading the pipeline code.
Special tokens (<|im_start|>, <|eot_id|>, etc.) are part of the tokenizer contract. A mismatch here corrupts the chat template silently.

In Parquet the output is an integer array column (token_ids: list<int32>) and a boolean / uint8 column (loss_mask: list<uint8>). Both compress very well under Zstd — integer sequences have low entropy and high repetition. Storing them columnar means the DataLoader reads only these two columns (not the raw text) at train time, which is the column-projection benefit from lesson 04.

Loss masking

Each token carries a binary loss mask: 1 means "compute cross-entropy loss here", 0 means "skip this token." The mask is not an afterthought — it determines what the model actually learns.

Regime	What gets mask = 1	What gets mask = 0
SFT	Assistant / response tokens	System prompt, user turn, padding
Preference (DPO)	Response tokens in both chosen and rejected	Shared prompt prefix, padding
Tool-use / agentic	Model-generated text and function calls	Tool observation / environment text the model did not generate

Packing (below) changes the attention mask and position IDs across document boundaries — it does not change these per-document loss-mask rules. Boundary / BOS tokens introduced by packing remain unmasked only if they would have been unmasked in the original example.

For SFT the loss mask is derived from the chat template structure: everything between <|im_start|>system / <|im_start|>user and the corresponding <|im_end|> is masked; only the assistant span is unmasked. For agentic / tool-use data, observation tokens — text produced by the environment, not by the model — must be masked too. Those are the "steps the model did not take" and training on them as if they were model output corrupts the policy. See the agentic RL lesson for how this interacts with multi-turn rollout structure.

SFT example · token sequence with mask:

  [SYS] You are a coding assistant.  [/SYS] [USER] Fix this bug. [/USER] [ASST] Here is the fix: ... [/ASST]
   ──────────────── mask = 0 ─────────────────────────────────────────   ───── mask = 1 ──────────────────

Agentic example:

  [ASST] <tool_call>search("foo")</tool_call>   [OBS] {"result":"bar"}  [/OBS]  [ASST] The answer is bar.
   ──────────── mask = 1 ─────────────────────    ──────── mask = 0 ────────────   ─────── mask = 1 ──────

Preference data tokenizes both branches (chosen, rejected) as separate sequences and applies the same prompt-masking rule to both. A common implementation error is to forget to mask the shared prefix in the rejected branch — the model then receives gradient signal telling it to reduce the log-prob of the prompt itself.

Sequence packing: eliminating padding waste

Most post-training examples are short. An SFT example might be 200–800 tokens; a math chain-of-thought might be 2 000. But the model's context window L is 4 096, 8 192, or 131 072 tokens. If you allocate one sequence per example and pad to L, the padding fraction is:

padding waste = 1 − mean_length / L

At L = 8192 and mean length 512, you waste 94% of every sequence. That means 94% of the GPU's attention FLOP and memory bandwidth serve padding tokens — tokens that contribute zero gradient. You are paying full training cost to process noise.

Sequence packing solves this by concatenating multiple examples end-to-end into one length-L sequence:

Padded (one example per sequence, L = 32):

  seq 1: [doc_A ─ 10 tok ─][padding ── 22 tok ────────────────]
  seq 2: [doc_B ── 14 tok ────][padding ──── 18 tok ────────────]
  seq 3: [doc_C ── 8 tok ─][padding ──── 24 tok ─────────────────]

Packed (multiple examples per sequence, L = 32):

  seq 1: [doc_A ─ 10 tok ─][doc_B ── 14 tok ────][doc_D ─ 8 tok ─]
  seq 2: [doc_C ── 8 tok ─][doc_E ── 12 tok ─────][doc_F ─ 7 tok][pad 5]
  seq 3: [doc_G ── 20 tok ─────────][doc_H ─ 12 tok ──────────────]

Packing raises utilization from mean_length / L toward ~100%. Assembling sequences is a bin-packing problem: each example is an item of size len_i, each sequence is a bin of capacity L. First-fit decreasing (FFD) — sort examples longest-first, greedily assign each to the first bin that still fits — achieves near-optimal utilization in linear time and is the standard approach. Best-fit decreasing gets marginally better fill at higher CPU cost and is rarely worth it at scale.

The cross-document attention bug

Packing without a matching attention mask silently corrupts training. A standard causal attention mask lets every token attend to all preceding tokens in the sequence. After packing, "preceding tokens" includes the tail of the previous document — which is semantically unrelated. Token i in document B attends to and is predicted given document A's context, which is wrong: it inflates log-probabilities for tokens that happen to follow a specific unrelated document, and injects spurious gradient signal. The fix is a block-diagonal attention mask (sometimes called "packed attention" or "document attention"): each document attends only to itself. Paired with per-document position IDs that reset to 0 at each document boundary (so positional encodings are locally coherent), this is the correct packing implementation. Flash Attention's varlen interface (cu_seqlens) implements this exactly. Without it, packing is not a free lunch — it is a training bug at scale.

For RL, packing shows up again in rollout batches: variable-length trajectories are packed into fixed-size batches for the value / advantage computation. The same correctness constraints apply — see RL lesson 22b on long-tail rollout handling.

The throughput math

Define:

N = number of examples in the dataset
L = context length (sequence length)
μ = mean example length in tokens

Without packing, the number of training sequences is N (one per example), total tokens processed = N × L, of which N × (L − μ) are padding. Padding efficiency = μ / L.

With packing, the number of sequences shrinks to roughly N × μ / L (approximately — bin-packing incurs small overhead). Total real tokens processed ≈ N × μ. The speedup from packing is approximately L / μ — the inverse of the padding efficiency. At L = 8192 and μ = 512 that is a 16× reduction in sequences and thus a ~16× throughput gain for identical gradient signal.

Interactive · packing-efficiency simulator

Set the context length, mean example length, and variance. Compare padded (one example per sequence) vs packed (bin-packed to fill the context window). Observe the padding waste, effective tokens, and the throughput multiplier.

Takeaway

What to carry to lesson 08

The gold layer is not just "run a tokenizer." It is: pin the tokenizer version for determinism, apply correct loss masks per regime (SFT masks prompt; tool-use masks observations; preference masks shared prefix in both branches), and pack with a block-diagonal attention mask or you silently corrupt training. The throughput gain from packing can be 10–30×, which is why every serious post-training pipeline does it. Lesson 08 asks the next question: before these gold batches reach the trainer, how do you validate that the token arrays, masks, and lengths are actually correct?