deep_learning / 07 · tokenization lesson 7 / 12

Tokenization — BPE, WordPiece, SentencePiece, byte-level

The interface between text and the model. A bad tokenizer caps every downstream metric — context length, multilingual fairness, RL credit assignment, math accuracy. Most "weird" LLM behaviour (counting r's in "strawberry", failing arithmetic) is a tokenizer artifact.

Why tokenize at all — the three options nobody picks

UnitVocabulary sizeSequence length per wordWhy nobody uses it
Characters~100 (ASCII)5–10× longerSequence length blows up. Self-attention is O(N²) — 5× more tokens is 25× more attention. Plus character-level models train slower because they have to re-learn morphology from scratch.
Whole words~1M (unbounded for inflected languages)~1 per wordOOV (out-of-vocabulary) words are unrepresentable. "covid" didn't exist in 2019 vocabs. Plus the embedding table is huge.
Bytes (raw UTF-8)2564–6× longer (UTF-8 multibyte)Same length problem as chars. Used by ByT5; works but expensive.

Subword tokenization is the compromise: a vocabulary of 32k–200k tokens, each a string of 1–10 characters. Common words are one token; rare words split into multiple. Three near-equivalent algorithms produce this vocabulary:

BPE — Byte Pair Encoding, the greedy merge algorithm

Sennrich et al. 2016, adapted from a compression algorithm. The training procedure:

  1. Start with characters. Split every word in the training corpus into characters. Add a special end-of-word marker (some BPE variants).
  2. Count all adjacent pairs. "the cat" → ("t", "h"), ("h", "e"), ("c", "a"), ("a", "t"). Count across the corpus.
  3. Merge the most common pair into a new token. Add to vocab.
  4. Repeat until vocab reaches target size (e.g. 50k).

Inference: greedy left-to-right merge using the trained merge table. "tokenization" might split as "token" + "ization" if those merges existed at training.

GPT-2's tokenizer is a "byte-level BPE": operates on UTF-8 bytes, not characters. This makes it lossless (any byte sequence is representable) without needing an explicit "unknown" token.

The Ġ trick
GPT-2 prefixes the first character of each word with "Ġ" (a unicode dot-character used as a space-marker) so that "the" at the start of a sentence is a different token from " the" mid-sentence. This is critical for autoregressive generation: tokenizing "T h e c a t" naïvely would lose the space. The Ġ keeps "the" (start-of-sentence) and "Ġthe" (mid-sentence) as distinct tokens.

WordPiece — slightly different objective

Used by BERT and most encoder models. Same idea (subword merges) but the merge criterion is different:

AlgorithmMerge criterion
BPEMost frequent adjacent pair.
WordPiecePair that maximises the likelihood of the corpus under a unigram language model. Roughly: freq(AB) / (freq(A) · freq(B)) — gives a "PMI"-like score.

Practically near-identical results. WordPiece's PMI flavour gives slightly more semantic subwords (it favours pairs where the two pieces co-occur more than independently). Vocab sizes 30k–50k.

SentencePiece — the framework, not the algorithm

SentencePiece is a tokenizer library, not a different algorithm. It can train BPE or Unigram tokenizers (Kudo 2018 Unigram is the third subword algorithm: optimises a unigram LM with EM, prunes the vocab to target size). Key differences from earlier BPE implementations:

  1. Operates on raw text, no pre-tokenisation. No "split on whitespace first" step. Handles languages without spaces (Chinese, Japanese, Thai) cleanly.
  2. Spaces are part of tokens. Internally uses "▁" (U+2581) to represent space. " hello" becomes "▁hello".
  3. Byte fallback. Tokens that can't be reached by any merge fall back to byte-level. No OOV token, always lossless.

LLaMA 1 / LLaMA 2 / Mistral / Gemma use SentencePiece. LLaMA 3 switched to a 128k-vocab tiktoken-based BPE (the same library OpenAI uses). The differences are subtle in practice — the algorithm is BPE in both cases — but the byte-handling and the regex pre-tokeniser differ.

The whole comparison

TokenizerVocabAlgorithmPre-tokenisationUsed by
GPT-2 BPE50kBPEregex on words, byte-level mergesGPT-2, GPT-3 (r50k / p50k)
cl100k BPE100kBPEupdated regex, better digit handlingGPT-3.5, GPT-4, GPT-4-turbo
WordPiece30kLikelihood-based mergewhitespace + punctuationBERT, DistilBERT
SentencePiece BPE32k–128kBPENone (raw text)LLaMA, Mistral
SentencePiece Unigram32kUnigram LM + EMNoneT5, ALBERT, mBART
o200k BPE200kBPE, multilingual-balancedregexGPT-4o, GPT-4o-mini; LLaMA-3 uses a tiktoken-based 128k variant
Byte (ByT5)256None — raw bytesNoneByT5, mT5-byte

Why tokenization breaks math and code — the load-bearing failure mode

Three concrete examples that show up in every base LLM:

  1. "What is 4283 + 7918?" The number "4283" might tokenize as ["4", "28", "3"] or ["428", "3"] depending on training-data frequency. The model must learn that "428" + "79" + carry = something — a task it has no symmetric inductive bias for. Result: arithmetic fails for numbers it hasn't memorised.
  2. "How many r's in 'strawberry'?" "strawberry" tokenises (in GPT-4's tokenizer) as a single token: "strawberry". The model doesn't see individual characters; it sees one opaque ID. Asking for character-level structure requires the model to have memorised every word's spelling — which it hasn't, especially for less common words.
  3. Code indentation. Python uses 4 spaces; the tokenizer might or might not have " " (4 spaces) as a single token. If not, the model emits 4 separate space tokens for indentation, doubling token count for code. tiktoken's o200k handles this; many older tokenizers don't.
The interview soundbite
Tokenization is a discretisation of language. Any task that requires character-level structure or compositional sub-word manipulation will fail at base-model level. Fine-tuning + chain-of-thought + tool use (call a calculator) are the typical mitigations. "The model can't count letters" is rarely a learning failure — it's a tokenizer constraint.

Multilingual fairness — tokens-per-word as the metric

A tokenizer trained on English gives English 1–2 tokens per word. For Hindi or Chinese, the same tokenizer gives 4–6× more tokens per equivalent meaning unit. This means:

The fix is a multilingual-balanced training corpus or a larger vocab. GPT-4's tokenizer increased from 100k → 200k tokens partly to add Chinese, Japanese, Korean multi-character tokens. Llama 3's 128k vocab is similarly motivated.

RL credit assignment at the subword level

From the RL track: in RLHF or RLVR, you assign a reward to a generated sequence, then compute policy gradients per-token. The reward / advantage gets distributed over the tokens that produced the output.

If "thank you" tokenises as one token ["thankyou"] in one model and three tokens ["thank", " you"] in another, the credit distribution is different. You can change the model's preferences by changing the tokenization, with no other change.

This is a real pitfall in RL post-training. The mitigations are:

Interactive · tokenization in action

Tokenize a string
Paste any text. The widget shows GPT-2-style greedy BPE tokenization using a small example vocab. The goal is to feel how merges work and where breakpoints land.
characters
tokens (gpt-2-ish)
bytes / token

The interview probes

  1. Why is BPE/SentencePiece preferred over characters? Sequence length. Attention is O(N²); 5× more tokens is 25× more compute. Also: subwords carry morphological information ("walking" = "walk" + "ing") that characters do not.
  2. Why is BPE preferred over whole words? No OOV. Any word can be split into subwords already in the vocab. Compositionality: "ungoogleable" = "un" + "google" + "able" — the model can compose without having seen the exact word.
  3. Why does GPT-4 do arithmetic better than GPT-2? Mostly: training data and CoT, but also tokenization. GPT-4's tokenizer represents 3-digit numbers as single tokens up to some range, which is enough memorisation surface for short arithmetic. GPT-2 splits arbitrarily and has no consistent structure.
  4. If you double the vocabulary size, what changes? (a) Embedding table doubles in size — significant for small models. (b) Output softmax doubles → slightly higher per-token compute. (c) Tokens per sentence decreases (more rare words become 1 token). (d) Per-token learning signal becomes weaker (each token seen less often). Trade-off: vocab size ~32–128k is empirically the sweet spot.
  5. What's a "tokenizer-free" model? Operates on bytes or characters directly. ByT5 (T5 on bytes) and CharFormer are examples. They're more robust to noise (typos, unicode) and remove the language-fairness problem, but cost 4–6× more compute per "word". Niche.

Where things go subtly wrong

BugSymptomDiagnosis
Tokenizer mismatch between train and inference Model gives garbage on simple prompts. Loaded a model with the wrong tokenizer. Always pair the tokenizer with the model checkpoint.
BOS / EOS handling Sequences don't terminate; or model can't be prompted past the first sentence. BOS/EOS are special tokens with reserved IDs. Forgetting to add them or adding the wrong ones at fine-tune vs pretrain causes silent corruption.
Leading-space tokenization "the" and " the" are scored very differently. Most BPE tokenizers treat the leading space as part of the token (the Ġ trick). When constructing prompts, include the leading space if you want the next word to be the "mid-sentence" variant.
UTF-8 split mid-character Generation produces an invalid UTF-8 byte sequence. Byte-level BPE can stop mid-character. Decoders should buffer until a valid UTF-8 codepoint forms. Most libraries handle this; check on streaming output.
Padding tokens leaking into loss Loss curves look healthy but model is bad at long sequences. The padding token's loss should be masked. If not, the model "learns" to predict padding wherever the data ends, which interferes with EOS.

Interview prompts you should be ready for

  1. "Why does the model fail at 'how many r's in strawberry'?" ("strawberry" is a single token in most modern tokenizers; the model never saw it as a sequence of characters. To answer reliably, the model would have to memorise the spelling of every word — which it hasn't for the long tail. Fixes: tool use (delegate to a string-length function), CoT (have the model spell out the word), or character-level fine-tuning.)
  2. "Walk me through BPE training in 60 seconds." (1. Initialise vocab = all bytes/characters. 2. Tokenise corpus. 3. Count adjacent token pairs. 4. Merge most frequent pair → new token. 5. Repeat until target vocab size. Inference: apply the same merges greedily left-to-right.)
  3. "Why does the vocab size matter for small models?" (Embedding params = V × d. For d=2048 and V=128k, that's 262M — a quarter of a 1B model. Reducing V to 32k saves 200M params that can go into the body. Beyond ~30B model size, vocab size is a rounding error.)
  4. "How would you tokenize for a code-focused model?" (Make sure indentation tokens (4 spaces, 8 spaces) and common code constructs ('def', 'function', '() {{ }}') are single tokens. Train on a corpus with code well-represented so frequent code patterns get merged.)
  5. "Two LLMs have the same architecture and same training data but different tokenizers. Same total compute. Which is better?" (The one with shorter sequences for the same content — typically the larger vocab — because attention is O(N²). But: per-token learning signal is weaker with rare tokens. The sweet spot is V≈100k–200k for modern English+multilingual LLMs.)
  6. "You're building an LLM for a low-resource language. Tokenizer concerns?" (English-trained tokenizers will give 4–6× tokens per word in low-resource languages. Train a balanced or oversample-the-low-resource-language tokenizer. Or use byte-level fallback. Or fine-tune the tokenizer (rare; usually retrain).)
Takeaway
Tokenization is a discretisation step that bounds every downstream metric. BPE/SentencePiece is the modern default. The vocab size is a parameter-vs-resolution trade. Most "weird" LLM failures (math, character counting, multilingual gap) are tokenizer artifacts, not learning failures. The senior signal is being able to identify when a failure is the tokenizer's fault and propose the right mitigation (tool use, CoT, fine-tune, retrain tokenizer).