Tokenization — BPE, WordPiece, SentencePiece, byte-level
The interface between text and the model. A bad tokenizer caps every downstream metric — context length, multilingual fairness, RL credit assignment, math accuracy. Most "weird" LLM behaviour (counting r's in "strawberry", failing arithmetic) is a tokenizer artifact.
Why tokenize at all — the three options nobody picks
| Unit | Vocabulary size | Sequence length per word | Why nobody uses it |
|---|---|---|---|
| Characters | ~100 (ASCII) | 5–10× longer | Sequence length blows up. Self-attention is O(N²) — 5× more tokens is 25× more attention. Plus character-level models train slower because they have to re-learn morphology from scratch. |
| Whole words | ~1M (unbounded for inflected languages) | ~1 per word | OOV (out-of-vocabulary) words are unrepresentable. "covid" didn't exist in 2019 vocabs. Plus the embedding table is huge. |
| Bytes (raw UTF-8) | 256 | 4–6× longer (UTF-8 multibyte) | Same length problem as chars. Used by ByT5; works but expensive. |
Subword tokenization is the compromise: a vocabulary of 32k–200k tokens, each a string of 1–10 characters. Common words are one token; rare words split into multiple. Three near-equivalent algorithms produce this vocabulary:
BPE — Byte Pair Encoding, the greedy merge algorithm
Sennrich et al. 2016, adapted from a compression algorithm. The training procedure:
- Start with characters. Split every word in the training corpus into characters. Add a special end-of-word marker (some BPE variants).
- Count all adjacent pairs. "the cat" → ("t", "h"), ("h", "e"), ("c", "a"), ("a", "t"). Count across the corpus.
- Merge the most common pair into a new token. Add to vocab.
- Repeat until vocab reaches target size (e.g. 50k).
Inference: greedy left-to-right merge using the trained merge table. "tokenization" might split as "token" + "ization" if those merges existed at training.
GPT-2's tokenizer is a "byte-level BPE": operates on UTF-8 bytes, not characters. This makes it lossless (any byte sequence is representable) without needing an explicit "unknown" token.
WordPiece — slightly different objective
Used by BERT and most encoder models. Same idea (subword merges) but the merge criterion is different:
| Algorithm | Merge criterion |
|---|---|
| BPE | Most frequent adjacent pair. |
| WordPiece | Pair that maximises the likelihood of the corpus under a unigram language model. Roughly: freq(AB) / (freq(A) · freq(B)) — gives a "PMI"-like score. |
Practically near-identical results. WordPiece's PMI flavour gives slightly more semantic subwords (it favours pairs where the two pieces co-occur more than independently). Vocab sizes 30k–50k.
SentencePiece — the framework, not the algorithm
SentencePiece is a tokenizer library, not a different algorithm. It can train BPE or Unigram tokenizers (Kudo 2018 Unigram is the third subword algorithm: optimises a unigram LM with EM, prunes the vocab to target size). Key differences from earlier BPE implementations:
- Operates on raw text, no pre-tokenisation. No "split on whitespace first" step. Handles languages without spaces (Chinese, Japanese, Thai) cleanly.
- Spaces are part of tokens. Internally uses "▁" (U+2581) to represent space. " hello" becomes "▁hello".
- Byte fallback. Tokens that can't be reached by any merge fall back to byte-level. No OOV token, always lossless.
LLaMA 1 / LLaMA 2 / Mistral / Gemma use SentencePiece. LLaMA 3 switched to a 128k-vocab tiktoken-based BPE (the same library OpenAI uses). The differences are subtle in practice — the algorithm is BPE in both cases — but the byte-handling and the regex pre-tokeniser differ.
The whole comparison
| Tokenizer | Vocab | Algorithm | Pre-tokenisation | Used by |
|---|---|---|---|---|
| GPT-2 BPE | 50k | BPE | regex on words, byte-level merges | GPT-2, GPT-3 (r50k / p50k) |
| cl100k BPE | 100k | BPE | updated regex, better digit handling | GPT-3.5, GPT-4, GPT-4-turbo |
| WordPiece | 30k | Likelihood-based merge | whitespace + punctuation | BERT, DistilBERT |
| SentencePiece BPE | 32k–128k | BPE | None (raw text) | LLaMA, Mistral |
| SentencePiece Unigram | 32k | Unigram LM + EM | None | T5, ALBERT, mBART |
| o200k BPE | 200k | BPE, multilingual-balanced | regex | GPT-4o, GPT-4o-mini; LLaMA-3 uses a tiktoken-based 128k variant |
| Byte (ByT5) | 256 | None — raw bytes | None | ByT5, mT5-byte |
Why tokenization breaks math and code — the load-bearing failure mode
Three concrete examples that show up in every base LLM:
- "What is 4283 + 7918?" The number "4283" might tokenize as ["4", "28", "3"] or ["428", "3"] depending on training-data frequency. The model must learn that "428" + "79" + carry = something — a task it has no symmetric inductive bias for. Result: arithmetic fails for numbers it hasn't memorised.
- "How many r's in 'strawberry'?" "strawberry" tokenises (in GPT-4's tokenizer) as a single token: "strawberry". The model doesn't see individual characters; it sees one opaque ID. Asking for character-level structure requires the model to have memorised every word's spelling — which it hasn't, especially for less common words.
- Code indentation. Python uses 4 spaces; the tokenizer might or might not have " " (4 spaces) as a single token. If not, the model emits 4 separate space tokens for indentation, doubling token count for code. tiktoken's o200k handles this; many older tokenizers don't.
Multilingual fairness — tokens-per-word as the metric
A tokenizer trained on English gives English 1–2 tokens per word. For Hindi or Chinese, the same tokenizer gives 4–6× more tokens per equivalent meaning unit. This means:
- Same-cost API call yields 4–6× fewer Chinese sentences than English.
- Same context window holds 4–6× less Chinese content.
- Generation is 4–6× slower in token count (and therefore latency, since latency ∝ tokens).
The fix is a multilingual-balanced training corpus or a larger vocab. GPT-4's tokenizer increased from 100k → 200k tokens partly to add Chinese, Japanese, Korean multi-character tokens. Llama 3's 128k vocab is similarly motivated.
RL credit assignment at the subword level
From the RL track: in RLHF or RLVR, you assign a reward to a generated sequence, then compute policy gradients per-token. The reward / advantage gets distributed over the tokens that produced the output.
If "thank you" tokenises as one token ["thankyou"] in one model and three tokens ["thank", " you"] in another, the credit distribution is different. You can change the model's preferences by changing the tokenization, with no other change.
This is a real pitfall in RL post-training. The mitigations are:
- Use the same tokenizer through pretraining, SFT, and RL. Don't swap.
- Use sequence-level rewards (final-token only) when possible, to avoid token-count effects.
- Normalise per-token advantages by sequence length (Dr.GRPO's bias correction, see RL/lessons/14).
Interactive · tokenization in action
The interview probes
- Why is BPE/SentencePiece preferred over characters? Sequence length. Attention is O(N²); 5× more tokens is 25× more compute. Also: subwords carry morphological information ("walking" = "walk" + "ing") that characters do not.
- Why is BPE preferred over whole words? No OOV. Any word can be split into subwords already in the vocab. Compositionality: "ungoogleable" = "un" + "google" + "able" — the model can compose without having seen the exact word.
- Why does GPT-4 do arithmetic better than GPT-2? Mostly: training data and CoT, but also tokenization. GPT-4's tokenizer represents 3-digit numbers as single tokens up to some range, which is enough memorisation surface for short arithmetic. GPT-2 splits arbitrarily and has no consistent structure.
- If you double the vocabulary size, what changes? (a) Embedding table doubles in size — significant for small models. (b) Output softmax doubles → slightly higher per-token compute. (c) Tokens per sentence decreases (more rare words become 1 token). (d) Per-token learning signal becomes weaker (each token seen less often). Trade-off: vocab size ~32–128k is empirically the sweet spot.
- What's a "tokenizer-free" model? Operates on bytes or characters directly. ByT5 (T5 on bytes) and CharFormer are examples. They're more robust to noise (typos, unicode) and remove the language-fairness problem, but cost 4–6× more compute per "word". Niche.
Where things go subtly wrong
| Bug | Symptom | Diagnosis |
|---|---|---|
| Tokenizer mismatch between train and inference | Model gives garbage on simple prompts. | Loaded a model with the wrong tokenizer. Always pair the tokenizer with the model checkpoint. |
| BOS / EOS handling | Sequences don't terminate; or model can't be prompted past the first sentence. | BOS/EOS are special tokens with reserved IDs. Forgetting to add them or adding the wrong ones at fine-tune vs pretrain causes silent corruption. |
| Leading-space tokenization | "the" and " the" are scored very differently. | Most BPE tokenizers treat the leading space as part of the token (the Ġ trick). When constructing prompts, include the leading space if you want the next word to be the "mid-sentence" variant. |
| UTF-8 split mid-character | Generation produces an invalid UTF-8 byte sequence. | Byte-level BPE can stop mid-character. Decoders should buffer until a valid UTF-8 codepoint forms. Most libraries handle this; check on streaming output. |
| Padding tokens leaking into loss | Loss curves look healthy but model is bad at long sequences. | The padding token's loss should be masked. If not, the model "learns" to predict padding wherever the data ends, which interferes with EOS. |
Interview prompts you should be ready for
- "Why does the model fail at 'how many r's in strawberry'?" ("strawberry" is a single token in most modern tokenizers; the model never saw it as a sequence of characters. To answer reliably, the model would have to memorise the spelling of every word — which it hasn't for the long tail. Fixes: tool use (delegate to a string-length function), CoT (have the model spell out the word), or character-level fine-tuning.)
- "Walk me through BPE training in 60 seconds." (1. Initialise vocab = all bytes/characters. 2. Tokenise corpus. 3. Count adjacent token pairs. 4. Merge most frequent pair → new token. 5. Repeat until target vocab size. Inference: apply the same merges greedily left-to-right.)
- "Why does the vocab size matter for small models?" (Embedding params = V × d. For d=2048 and V=128k, that's 262M — a quarter of a 1B model. Reducing V to 32k saves 200M params that can go into the body. Beyond ~30B model size, vocab size is a rounding error.)
- "How would you tokenize for a code-focused model?" (Make sure indentation tokens (4 spaces, 8 spaces) and common code constructs ('def', 'function', '() {{ }}') are single tokens. Train on a corpus with code well-represented so frequent code patterns get merged.)
- "Two LLMs have the same architecture and same training data but different tokenizers. Same total compute. Which is better?" (The one with shorter sequences for the same content — typically the larger vocab — because attention is O(N²). But: per-token learning signal is weaker with rare tokens. The sweet spot is V≈100k–200k for modern English+multilingual LLMs.)
- "You're building an LLM for a low-resource language. Tokenizer concerns?" (English-trained tokenizers will give 4–6× tokens per word in low-resource languages. Train a balanced or oversample-the-low-resource-language tokenizer. Or use byte-level fallback. Or fine-tune the tokenizer (rare; usually retrain).)