Pretraining — one loss, all the capability

Why minimizing next-token surprise is enough to grow grammar, world knowledge, arithmetic, and a faint shimmer of reasoning — and why it still produces a model that won't follow your instructions.

The objective, in one line

Open 00_pretrain.py and the entire training pipeline reduces to a single expectation:

L_pretrain(θ) = − 𝔼_{x ∼ corpus} Σ_t=1..T log p_θ(x_t | x_<t)

Read this aloud: pick a random window of text x = (x₁, …, x_T) from the corpus, predict each token from the ones that came before it, sum the per-position negative log-likelihoods, average over windows. There is no label, no task, no “prompt” vs “response” — there is only a stream of tokens and the model's job of guessing what comes next. The corpus literally is the supervision signal.

The expectation has two layers worth pulling apart. The outer expectation over x ∼ corpus is what makes this a statistical learning problem rather than a memorization problem: we want the model to generalize to text it has not seen. The inner sum over positions t is what makes pretraining so computationally efficient — every position in every window is a full training example, scored independently, and (thanks to the causal mask we'll get to) every one of those T per-window losses comes out of a single forward pass. A 4-billion-token corpus with T = 64 windows therefore offers around 4×10⁹ labeled examples for free.

Notation glossary, kept compact

x_<t — all tokens strictly before position t. The context the model conditions on.
p_θ(x_t | x_<t) — the model's predicted probability of the actual next token, evaluated against the truth.
θ — every learnable weight in the transformer (token embedding, attention W_q,k,v,o, MLP, layer norms; the head is weight-tied to the embedding).
Negative log-likelihood, cross-entropy, and perplexity are the same quantity expressed in three units. PPL = exp(L); cross-entropy is just NLL with the “each token contributes one bit/nat” framing made explicit.

Why this loss and not a dozen plausible alternatives

Self-supervised learning has had its share of objectives. Word2vec used negative sampling on co-occurrence. BERT used masked-language-modeling (MLM): replace 15% of tokens with [MASK], predict them from both left and right context. SimCLR-style contrastive losses learn representations by pulling augmented views together. So why does every frontier LLM use plain old left-to-right next-token prediction?

It isn't the only self-supervised objective — MLM, contrastive losses, and co-occurrence statistics are all label-free. What's special about next-token prediction is the combination of three properties; each is worth a paragraph.

(a) Zero labels. Any contiguous text is a training example. The objective is “next token,” and the next token is always sitting right there in the corpus. There is no annotator, no quality filter beyond corpus selection, no class imbalance — the supervision is structural. This matters because data is the bottleneck in modern pretraining, and any objective that requires hand-labeled targets caps your maximum corpus size at the size of your labeling budget. Next-token prediction has no such cap; you can throw everything at it.

(b) One pass yields T loss terms. The model emits a logit at every position. With the causal mask (covered below) those logits are mutually independent in the sense that position t never reads positions > t, so we can evaluate the loss at all T positions in parallel. MLM, by contrast, only trains the masked 15% of positions per pass — there's no signal at the un-masked ones because the model can trivially see them. Causal LMs therefore extract roughly 6× the gradient signal per FLOP. For generation models this efficiency advantage compounds enormously.

(c) Every downstream task is a special case. “Answer this question” is “continue this text where the next text is an answer.” “Translate this sentence” is “continue this text where the next text is the same sentence in French.” “Write this code” is “continue this text where the next text is well-formed Python.” If p_θ(x_t | x_<t) is accurate for every context, then conditional generation works for free. The objective is universal in a way most are not.

Trade-off vs MLM

Causal LMs train every position but only see left context. Masked LMs train fewer positions per pass but see bidirectional context — which is strictly more informative for understanding a token's meaning given its surroundings. So for representation tasks (classification, retrieval, sentence similarity) BERT-style MLM models often beat causal models at equal scale; for generation tasks causal wins because the architecture matches the inference setting (you can't see the future when sampling). Once we crossed into “generation is the unified interface,” causal LMs ate everything.

Compression as the engine of capability

Here is the part of pretraining that still feels improbable. We minimize a single scalar — the negative log-likelihood of next tokens — and out of that simple optimization comes grammar, syntax, world knowledge, basic arithmetic, the rough shape of logical inference, code structure, dialogue conventions, and the surface features of a hundred natural languages. Why does “guess the next token” produce so much?

The shortest answer is: there is no other way to lower the loss. Imagine a model trying to predict the token immediately after the substring "The capital of France is ". A model with no knowledge of geography spreads probability mass roughly uniformly across plausible continuations and pays a high NLL on whichever token actually follows. A model that has compressed the fact France's capital is Paris into its weights places most of its mass on "Paris" and pays nearly zero loss. The compression is unavoidable: any feature of the world that systematically shows up in text is a feature whose representation lowers loss when learned. Hutter and Solomonoff make this precise — optimal compression of a stream is equivalent to optimal prediction of the stream — but the operational intuition is the one that matters: capability emerges as a side-effect of compressing the distribution.

Compression here is doing real work. The model has, say, 10⁸ parameters and is being asked to predict the next token across 10¹¹ training tokens. It cannot memorize; it must represent regularities. Those regularities are the things we colloquially call “knowledge” and “skills.” Arithmetic emerges because addition problems show up in text and being able to compute their answers correctly lowers NLL. Code structure emerges because variables that get used later were defined earlier, and tracking those references lowers NLL. Basic theory of mind emerges because dialogue is full of attributions to other minds, and modeling them is the lowest-loss strategy for predicting how a character will respond.

This is why the data mix matters so much. Pretraining bakes in capabilities that are statistically supported by the corpus — if you want better math, include more math.

What pretraining actually produces — a completer, not an assistant

Take a freshly pretrained model and prompt it with "What is 2+2?". What do you get?

Honest answer: probably a continuation like "What is 3+3? What is 4+4? What is 5+5?". Or "What is 2+2? was the question my teacher asked us on the first day of class." Or "What is 2+2?\n\nThis lesson, we will explore basic addition...". All three are faithful: each is a plausible continuation of a string that looks, in the corpus, like the opening line of a worksheet, a memoir, or a textbook. The pretrained model has no notion of “the user wants the answer”; it knows only what text typically follows other text.

This is the gap that lesson 3 (SFT) closes, and the reason every modern LLM goes through post-training. It is worth pausing on, because it explains an enormous fraction of why pretraining alone is not enough.

The diagram captures the conceptual setting. The pretrained model defines a distribution p_θ(continuation | prompt). Each branch is a continuation; the model's job has always been to predict the most likely one. The desired branch — “4” — exists in the tree, but unless the prompt looks like “Q: What is 2+2?\nA:” (with that exact corpus-shape signal), the model has no reason to prefer it over more probable branches. The gap is structural: likelihood under the corpus is not the same as task-completion. SFT bends the distribution; pretraining cannot.

Why this is structural, not a bug

Pretraining maximizes likelihood under p(text). The corpus contains far more text that “keeps the prompt going” than text that “answers the prompt directly,” because most text on the internet is not in a Q/A format. The model is doing exactly what it was asked to do; the loss simply does not encode “the user wants you to be helpful.”

Random windows, not sequential passes

Look at CharDataset.get_batch:

def get_batch(self, split, B, T, device="cpu"):
    d = self.train if split == "train" else self.val
    ix = torch.randint(0, len(d) - T - 1, (B,))        # B random start positions
    x = torch.stack([d[i    : i + T    ] for i in ix]) # (B, T)
    y = torch.stack([d[i + 1: i + T + 1] for i in ix]) # (B, T) shifted by one
    return x.to(device), y.to(device)

The dataset samples B uniformly random start positions per step and returns the contiguous window of length T starting at each. We are not iterating the corpus end-to-end; we are sampling with replacement.

Why? Two reasons, both about gradient quality.

Decorrelation. SGD's convergence theory assumes the gradient at each step is an unbiased estimate of the full-batch gradient with bounded variance. Sequential passes break that assumption: consecutive windows share most tokens and their gradients are highly autocorrelated. The optimizer over-fits to recent locations in the corpus and the loss curve becomes wavy and slow. Random sampling restores approximate independence; the gradient at step t looks like a fresh draw, the same way it does for IID datasets like ImageNet. AdamW assumes IID-ish gradients in its variance estimator and is happier under this regime.

Position-offset coverage. A token at corpus position 12345 will appear in windows starting at 12281, 12282, …, 12345 — that is, at every relative position 0 through T-1 across different random draws. Sequential passes only ever see each token at one position offset (it appears in only one window). Random sampling forces the model to generalize across positions, which is exactly the kind of generalization the position-embedding mechanism is supposed to deliver. This is a feature, not a bug.

Trade-off

Random sampling reuses tokens. Over a training run with random windows, each token in the corpus is seen many times at many different offsets, blending memorization and generalization. For very small corpora this risks overfitting (you'll memorize the corpus before you generalize). For very large corpora it's the right choice. The frontier-LLM compromise is “document-shuffled epochs”: shuffle the order of documents but iterate within each, getting most of the decorrelation benefit while limiting reuse.

Shifted targets — the “next” in next-token

The dataset returns (x, y) where y[t] = x[t+1]. The model emits a logit at every position t based on x[0..t], and that logit is scored against y[t], which is the token that actually follows. Concretely:

To be, or

x — the input the model sees

o be, orn

y — the target at each position is the token that came one step later (the corpus continues "...or not...")

In code, the loss is computed in one shot with

_, loss = model(x, y)                             # cross_entropy is computed inside MiniGPT.forward
# inside forward(), with logits: (B, T, V) and targets: (B, T):
#   loss = F.cross_entropy(logits.view(B*T, V),    # flatten batch+time
#                          targets.view(B*T))     # mean over B*T positions

That's B·T independent next-token predictions every step. With B = 32, T = 64, every gradient update is reading 2,048 supervised examples. The shift trick is how the streaming corpus is converted into supervised pairs for free.

The causal mask is what makes all of this parallel

If we didn't mask, position t's logit could attend to position t+1 and cheat — the model would learn to copy the answer from the future. With the causal mask in place, position t's representation depends only on positions ≤ t, so its logit's prediction of token t+1 is “honest” — the model didn't see t+1 when producing it.

The architectural payoff is enormous. RNNs achieve the same causality by being serial in time: produce hidden state h₁, then use it to produce h₂, then h₃, ... — O(T) sequential operations per example. Transformers do all T positions in parallel by computing attention with a T × T mask. One forward pass; T losses; full GPU utilization. This is the single biggest reason transformers train faster than RNNs per unit of hardware.

Trade-off

Causal attention costs O(T²) memory and FLOPs in sequence length. RNNs are O(T). For very long sequences (32k+ tokens) this quadratic term becomes painful and motivates a whole sub-field — flash-attention, sliding-window attention, linear attention, state-space models. At the toy scale of 00_pretrain.py (T = 64) the quadratic cost is invisible.

The training loop, in 12 lines

for step in range(steps): x, y = dataset.get_batch("train", B, T, device) # (B, T), (B, T) _, loss = model(x, y) # F.cross_entropy inside opt.zero_grad() loss.backward() # accumulate ∇θ L torch.nn.utils.clip_grad_norm_(params, 1.0) # bound worst-case step opt.step() # AdamW(β₁=0.9, β₂=0.95) if step % 200 == 0: with torch.no_grad(): xv, yv = dataset.get_batch("val", ...) _, val_loss = model(xv, yv) print(step, loss, val_loss)

Twelve lines, and they would not change in shape for a 70-billion-parameter model. What changes for real models is the per-step scale — gradient accumulation across many GPUs, ZeRO sharding of optimizer state, mixed-precision arithmetic, careful learning-rate schedules — but the inner contract (forward, loss, backward, clip, step) is the same loop.

Optimizer choices, justified

The pretraining file picks AdamW with specific hyperparameters. Each one is a deliberate response to a known failure mode.

lr = 3e-4. A robust default for transformer pretraining at this scale. Too high and Adam's adaptive normalization can mask divergence until it explodes; too low and you waste compute. Modern recipes use warmup + cosine schedule; the toy file uses constant lr for clarity.

β₁ = 0.9, β₂ = 0.95. Adam's β₂ controls how quickly the running estimate of the squared gradient adapts. The default 0.999 takes ~1000 steps to forget the past — slow at the start of training, when gradient magnitudes are changing fast. β₂ = 0.95 forgets in ~20 steps, which tracks the rapidly-changing scale of early-training gradients without losing too much smoothing later. This is one of the small details that GPT-2/3 papers introduced and that has stuck in every transformer codebase since.

weight_decay = 0.1, decoupled. The “W” in AdamW. Plain Adam with L2 regularization fights itself: the regularizer's gradient gets divided by Adam's adaptive scaling, so big parameters get less penalty than they should. Decoupled weight decay (Loshchilov & Hutter 2017) instead subtracts η·λ·θ directly from θ, post-update, so the regularization strength is independent of gradient scale. The 0.1 value is high relative to vision-model defaults (~1e-4) — transformers tolerate, and arguably benefit from, much stronger regularization, especially on the embedding matrix.

grad-norm clip = 1.0. This is the one that catches the rare bad batch. A typical step sees gradients with norm well under 1.0; an occasional batch — say one with a rare punctuation pattern, or a degenerate window — produces a gradient with norm 50. Without clipping, that single step can move the policy far enough that recovery takes hundreds of steps. With clipping, the worst-case step is bounded: if ∥g∥₂ > 1.0, rescale g ← g · (1.0 / ∥g∥₂), so the clipped norm equals min(∥g∥₂, 1.0). The cost is the rare beneficial “big move” you suppress; the benefit is that training never blows up. It's cheap insurance.

Why grad-clipping in particular matters here

Rare tokens have rare embeddings whose gradients accumulate sharply on the few batches that touch them. Their unclipped gradient norm can be 100× the mean. Clipping treats those steps as “direction is fine, magnitude isn't”: keep the direction, cap the step. It's the single highest-leverage line of stability code in the loop.

Interactive · build a tiny char-level next-token predictor

To put hands on the objective, here is a toy bigram/trigram model trained (well — its probabilities hardcoded) on the same Shakespeare passage from 00_pretrain.py. Type a prompt or click a preset, see the predicted distribution over the next character, and click “Greedy continue” to watch it complete the text one character at a time. Move the temperature slider to see how a flatter distribution changes which continuations get sampled. This isn't a transformer, but the loss it optimizes and the inference loop it runs are identical in shape to what 00_pretrain.py does at billion-parameter scale.

Char-level next-token predictor

A bigram + trigram smooth model built from a short Shakespeare passage. Watch the top-5 next-char predictions update as you type. Temperature reshapes the distribution at sampling time — high temperature = more diverse, low = more deterministic.

temperature: 1.00

prompt + completion (model output appears in blue)

top-5 predictions for the next character

Chars generated

Entropy (last dist)

—

Perplexity-ish

—

Temperature

1.00

What this micro-experiment shows you

Run the widget for a minute. You will see three things that scale up unchanged to a real LLM:

The distribution over next characters is peaked but not deterministic. The model has views; it does not have certainty. Lowering temperature concentrates that mass; raising it flattens it. This is exactly the same knob the production sampler exposes.
The generated text continues the style of the prompt. Start with "To be" and you will drift into Shakespeare-flavored fragments, because that is the only thing the model knows. Same as the pretrained transformer in 00_pretrain.py when given the same starting string.
The model has no concept of “answer the question.” If you type "What is", the most likely continuation is whatever sequence of characters most often follows that prefix in the corpus — which is unlikely to be a useful answer. The same is true of the full transformer. The same is true of GPT-3 base. This is the universal property of pretrained LMs.

Scaling laws, the briefest possible sketch

The toy in 00_pretrain.py uses d = 128, L = 4, h = 4 — perhaps 200k non-embedding parameters — trained on a few thousand characters of Shakespeare. The same loop, the same model class, the same loss, scaled by a factor of roughly 10⁶× in parameters and 10⁹× in tokens, produces GPT-4. Nothing structural changes; the mechanism is identical.

The empirical scaling laws (Kaplan, Henighan, Hoffmann/Chinchilla) tell you how to spend a compute budget. The Chinchilla rule of thumb is ~20 training tokens per parameter: a 7B-parameter model wants ~140B training tokens; a 70B model wants ~1.4T. The exact constant has moved over time as people have discovered they can squeeze more out of per-token training, but the basic geometry — parameters and tokens scale together, sub-optimally if either gets too far ahead — is robust.

The takeaway for this lesson: none of what we've described is exotic at the frontier. Frontier pretraining is the same loss, the same shift-by-one trick, the same AdamW + grad-clip, with three orders of magnitude more of everything. The model you build in 00_pretrain.py is the same model, in the same way that a single-cell organism is the same kind of thing as a whale.

What pretraining cannot do

Now the limitation, stated as cleanly as possible. The pretraining objective prefers continuations that are more probable under the corpus. It has no other preference. If the corpus contains a lot of partially-completed worksheets and very few direct “Q: ... A: ...” pairs, then completing a worksheet beats answering the question — because that is what the loss rewards. If the corpus contains assertive, confident prose and very little uncertainty, the model will be assertive and confident even when it should not be — because that is what the loss rewards. Pretraining is faithful to its data; it is not faithful to what the user wants.

You cannot fix this with a better corpus alone. You can shift the balance — and in fact “instruction-tuning-style data” in the pretraining mix already helps — but the objective itself does not have a notion of “preferred response to instruction.” It only has “likely continuation.” To get task-completion behavior, you need an objective that targets task-completion behavior. That is what comes next.

The takeaway

Pretraining produces a model that is excellent at p(text) and indifferent to p(useful response | instruction). Every capability we love in a chatbot — instructions, calibrated uncertainty, refusals, tool use — has to be put there by post-training, on top of a base that already knows grammar, world facts, and how to keep a thought going.

Bridge to lesson 3 — SFT

The fix is astonishingly small. The loss stays the same: cross-entropy on next tokens. The architecture stays the same: same transformer, same weights, same forward pass. The optimizer stays the same: AdamW + grad-clip. What changes is just two things:

The data. Instead of raw corpus windows, we feed (prompt, response) pairs in a chat template like < user_prompt > assistant_response #.
The loss mask. We set the target at every prompt position to IGNORE_INDEX = -100, which F.cross_entropy skips. The loss is therefore computed only at response positions.

That's it. Two lines of code and a different data shape, and the same model learns to dispatch from instruction-shaped input to response-shaped output. All the knowledge — grammar, world facts, arithmetic — is already in the weights from pretraining. SFT does not add knowledge; it adds a convention about how to surface it.

Lesson 3 walks through that one-mask-one-template trick in detail, explains why masking the prompt is non-negotiable (without it, you are training the model to generate user questions, which is the opposite of the goal), and ends with the question SFT itself cannot answer: what do you do when there is no single correct response, only a preference between two responses? That is the gap DPO closes in lesson 4.