all_lessons / ml_system_design / 17 · case · long context lesson 17 / 20

Case study — long-context document AI (100K–1M tokens)

Every prior case had one binding wall. This one is different: two walls bind at once, and both are driven by the same knob — sequence length. As context grows, prefill compute goes quadratic (the seq² regime of lesson 06 §6) and a single request's KV outgrows one GPU (lesson 02's 320 KB/token). The unusual consequence: you reach for training-style parallelism (lesson 07) to serve one request, because that one request is so big it needs the whole node.

The brief
Summarize and answer questions over very long documents — legal contracts, whole codebases, books. Context: 100K–1M tokens. Output: ~1,000 tokens (a summary or answer). Traffic: low QPS — each request is enormous, so a handful of concurrent requests already saturates a node. SLO: TTFT may be seconds but is bounded — say < 20 s; cost per request matters because each one is expensive. Model: a 70B dense (lesson 02 numbers — 320 KB/token KV, 140 GB weights).

Run the loop. Before reading on: which wall binds — memory, bandwidth, compute, or latency? Here the honest answer is "two of them, simultaneously, and they're the same variable." That co-binding is the whole lesson.

Stage 0/1 · The seq² compute wall (lesson 06 §6)

Prefill FLOPs are not simply 2N·seq at this scale. Lesson 06 §6 split them: the MLP/projection term scales with 2N·seq (linear in context), but the attention score+value term scales with seq²·d_model — quadratic. They cross when seq ≈ 12·d_model, which for a 70B (d_model ≈ 8K) lands near ~16–24K tokens.

prefill FLOPs(seq) ≈ 2N·seq  (MLP, linear)  +  ≈ seq²·d_model  (attention, quadratic)

Below the crossover the 2N term dominates and our napkin math holds. Above it the seq² term takes over and prefill time grows quadratically with document length. Even ignoring the quadratic part, the linear part is already brutal. At ~40% MFU on one H100 the 2N-rate prefill throughput is

990e12 · 0.4 / (2 · 70e9) ≈ 2,830 tokens/s

So a 128K-token document costs 128000 / 2830 ≈ 45 s on the linear term alone — already over the 20 s budget — and at 128K we are well past the ~20K crossover, so the seq² term is now adding materially on top. A 1M-token document is worse than linearly worse.

Binding constraint #1 — prefill is firmly compute-bound and super-linear
This is the opposite of the code-assistant case (lesson 13), where prefill was tiny and the problem was redundancy. Here the prefill is the dominant cost and TTFT scales worse than linearly with document length. You are deep on the right of the roofline ridge (lesson 02): more FLOPs is the only thing that helps, and there aren't enough on one GPU. This is a compute wall, and the compute tools (parallelize it, chunk it) are what apply.

Stage 2 · The KV capacity wall (lessons 02, 04)

The second wall arrives from the same knob. Lesson 02's kv_bytes/token = 2·L·H_kv·d·dtype = 320 KB/token for a 70B. Multiply by context:

128K request: 128000 · 320KB ≈ 41 GB of KV  |  1M request: 1e6 · 320KB ≈ 320 GB of KV

An H100 holds 80 GB. After 140 GB of weights (already two GPUs) there is no room for a 41 GB KV blob alongside, and a 320 GB KV exceeds four H100s' total memory by itself. A single request's KV no longer fits on one GPU. This breaks the assumption underlying every prior case — that one request lives on one replica and parallelism exists to serve many requests.

Binding constraint #2 — one request's KV exceeds one GPU
Note what just happened: stage 1 (compute) and stage 2 (capacity) are both functions of seq, and both go past breaking point in the same context range. You cannot relieve one without confronting the other. The capacity wall forces you to shard one request's sequence across GPUs — and conveniently, that same sharding spreads the seq² compute too.

Stage 3 · Sequence / context parallelism (lesson 07)

Lesson 07 listed sequence/context parallelism as the long-context training lever: shard the sequence dimension across GPUs so each holds a slice of the tokens' activations. The same axis solves serving here. Each GPU holds part of the KV and computes attention over its slice; a ring/context-parallel exchange passes the K/V slices around so every query token can attend to the whole sequence. Mechanism: system_ml 08.

one 320 GB KV (1M tokens) sharded across a node: GPU0: tokens 0–250K (≈80 GB KV slice) GPU1: tokens 250K–500K (≈80 GB KV slice) GPU2: tokens 500K–750K (≈80 GB KV slice) GPU3: tokens 750K–1M (≈80 GB KV slice) └── ring exchange of K/V so each query attends the full sequence ──┘

This is parallelism applied to a single request — unusual. Normally parallelism serves many requests (lesson 05); here one request is so big it needs the whole node to itself. Two companions from lesson 06 compose with it:

Stage 4 · Amortize the giant prefill (lesson 06 prefix caching)

Here is the single biggest design win when the access pattern allows it. Often the same document is queried many times — a lawyer asks 20 questions of one contract, an engineer asks 50 of one codebase. Naively, each query re-prefills the whole 128K document: 20 × 45 s of pure waste. Instead, prefill the document's KV once and reuse it across every query (lesson 06 §3 prefix caching; the same hot-doc reuse as RAG in lesson 15).

amortized prefill / query ≈ (one-time 45 s prefill) / queries_per_doc  +  cheap per-query decode

At 20 queries the 45 s prefill amortizes to ~2.3 s/query — under budget, and the marginal query is just decode over a question and a 1K-token answer. The wall moves: it is no longer per-query prefill compute, it is the one-time prefill plus the KV storage you must keep resident (41 GB, or 20 GB at fp8) for as long as the document is hot.

Binding constraint, final — the wall moves with the access pattern
With reuse, both stage-1 and stage-2 walls collapse from per-query to one-time costs. The system's job becomes KV residency and eviction policy: keep hot documents' KV pinned across context-parallel shards, evict cold ones. We never bought a magically faster prefill; when the document is reused we simply stop repeating it — and when it isn't, we shard it across the node to survive both walls at once.

Interactive · the seq² wall & single-GPU KV ceiling

Drag context length up and watch both walls close together: prefill time goes super-linear (the seq² factor), and the KV bar blows past one GPU into a context-parallel GPU count. Then turn the amortization knob — queries per document — and watch the per-query cost collapse as the one-time prefill spreads over many cheap queries.

The seq² wall & single-GPU KV ceiling

Assumptions: linear prefill at 2N-rate = 990e12·0.4/(2N) tok/s; the seq² term is modeled simply as multiplying prefill time by (1 + ctx/crossover) with crossover ≈ 12·8192 ≈ 100K (a stand-in for the quadratic attention cost — it understates 1M but captures the regime). KV/token ≈ 320 KB scaled by N/70 and by dtype/2 (fp8=1, fp16=2). Usable HBM per GPU ≈ 80 − 2N − 4. $3/GPU-hr; decode ≈ 1K tokens at the 2N decode bound. Order-of-magnitude per the series' ±30% contract.

prefill time (seq²)
KV size
GPUs (weights+KV)
amortized $/query

What this case teaches