Case study — long-context document AI (100K–1M tokens)
Every prior case had one binding wall. This one is different: two walls bind at once, and both are driven by the same knob — sequence length. As context grows, prefill compute goes quadratic (the seq² regime of lesson 06 §6) and a single request's KV outgrows one GPU (lesson 02's 320 KB/token). The unusual consequence: you reach for training-style parallelism (lesson 07) to serve one request, because that one request is so big it needs the whole node.
Run the loop. Before reading on: which wall binds — memory, bandwidth, compute, or latency? Here the honest answer is "two of them, simultaneously, and they're the same variable." That co-binding is the whole lesson.
Stage 0/1 · The seq² compute wall (lesson 06 §6)
Prefill FLOPs are not simply 2N·seq at this scale. Lesson 06 §6 split them: the MLP/projection term scales with 2N·seq (linear in context), but the attention score+value term scales with seq²·d_model — quadratic. They cross when seq ≈ 12·d_model, which for a 70B (d_model ≈ 8K) lands near ~16–24K tokens.
Below the crossover the 2N term dominates and our napkin math holds. Above it the seq² term takes over and prefill time grows quadratically with document length. Even ignoring the quadratic part, the linear part is already brutal. At ~40% MFU on one H100 the 2N-rate prefill throughput is
So a 128K-token document costs 128000 / 2830 ≈ 45 s on the linear term alone — already over the 20 s budget — and at 128K we are well past the ~20K crossover, so the seq² term is now adding materially on top. A 1M-token document is worse than linearly worse.
Stage 2 · The KV capacity wall (lessons 02, 04)
The second wall arrives from the same knob. Lesson 02's kv_bytes/token = 2·L·H_kv·d·dtype = 320 KB/token for a 70B. Multiply by context:
An H100 holds 80 GB. After 140 GB of weights (already two GPUs) there is no room for a 41 GB KV blob alongside, and a 320 GB KV exceeds four H100s' total memory by itself. A single request's KV no longer fits on one GPU. This breaks the assumption underlying every prior case — that one request lives on one replica and parallelism exists to serve many requests.
Stage 3 · Sequence / context parallelism (lesson 07)
Lesson 07 listed sequence/context parallelism as the long-context training lever: shard the sequence dimension across GPUs so each holds a slice of the tokens' activations. The same axis solves serving here. Each GPU holds part of the KV and computes attention over its slice; a ring/context-parallel exchange passes the K/V slices around so every query token can attend to the whole sequence. Mechanism: system_ml 08.
This is parallelism applied to a single request — unusual. Normally parallelism serves many requests (lesson 05); here one request is so big it needs the whole node to itself. Two companions from lesson 06 compose with it:
- Chunked prefill. A 45 s monolithic prefill would freeze everything; split it into chunks so TTFT growth is bounded and any co-resident decode isn't starved (lesson 06 §4). It does not reduce total work — the seq² FLOPs are still there — but it bounds the spike.
- KV quantization (fp8). Store K/V at fp8 instead of fp16 and lesson 02's KV bytes halve: the 41 GB request drops to ~20 GB, the 1M request to ~160 GB — fewer GPUs to shard across. It targets the capacity wall directly (lesson 06 §1), at a small long-context accuracy risk you gate behind evals (lesson 10).
Stage 4 · Amortize the giant prefill (lesson 06 prefix caching)
Here is the single biggest design win when the access pattern allows it. Often the same document is queried many times — a lawyer asks 20 questions of one contract, an engineer asks 50 of one codebase. Naively, each query re-prefills the whole 128K document: 20 × 45 s of pure waste. Instead, prefill the document's KV once and reuse it across every query (lesson 06 §3 prefix caching; the same hot-doc reuse as RAG in lesson 15).
At 20 queries the 45 s prefill amortizes to ~2.3 s/query — under budget, and the marginal query is just decode over a question and a 1K-token answer. The wall moves: it is no longer per-query prefill compute, it is the one-time prefill plus the KV storage you must keep resident (41 GB, or 20 GB at fp8) for as long as the document is hot.
Interactive · the seq² wall & single-GPU KV ceiling
Drag context length up and watch both walls close together: prefill time goes super-linear (the seq² factor), and the KV bar blows past one GPU into a context-parallel GPU count. Then turn the amortization knob — queries per document — and watch the per-query cost collapse as the one-time prefill spreads over many cheap queries.
What this case teaches
- Two walls, one knob. Unlike every prior case, long context binds on both compute and capacity at once — and both are driven by sequence length. Relieving one forces you to confront the other.
- The seq² compute regime is real (lesson 06 §6). Past the ~16–24K crossover for a 70B, prefill stops being 2N·seq and goes quadratic; TTFT scales worse than linearly. Long context is its own cost regime.
- Parallelism for a single request. When one request's KV (41 GB at 128K, 320 GB at 1M) exceeds one GPU, you shard its sequence across the node with context/sequence parallelism (lesson 07) — training's long-context lever, repurposed for serving one giant request.
- Prefill amortization is the biggest win when reuse exists. If a document is queried many times, prefill it once and reuse the KV (lesson 06 prefix caching); the wall moves from per-query prefill to one-time prefill + KV residency. Compose with chunked prefill (bound the spike) and fp8 KV (halve the bytes).