Case study — a RAG knowledge assistant
The code assistant (lesson 13) had a near-static input and tiny output; the consumer chatbot (lesson 14) had cheap input and a concurrency-via-output wall. This one inverts both: the input is huge and changing — 6–8K tokens of freshly retrieved context per query — so the binding wall is the prefill cost of long context. Worse, there is now a second subsystem (retrieval) living inside the same latency budget, and the documents go stale during the day — a cache-coherence wrinkle no earlier case had.
Run the loop. Before reading on: where does the 1.5 s go — retrieval or prefill? Is a bigger model affordable here? And what breaks the moment a document is edited?
Stage 0 · Requirements → the number that matters (lesson 03)
The new structure: TTFT now has two consumers, not one.
Retrieval = embed the query + ANN search + rerank ≈ 50–150 ms. That alone eats up to 10% of the budget before the LLM sees a token. Prefill is the rest — and it is large, because the input is ~8K tokens (7K retrieved + 1.5K system, minus overlap). Output is only 400 tokens, so by Little's Law decode concurrency is modest (W ≈ 1.5 + 400·0.03 ≈ 13.5 s, L = 500·13.5 ≈ 6,750 — a normal fleet). This is a TTFT-via-long-prefill problem, with a retrieval tax on top.
Do the prefill arithmetic on a 70B (compute-bound, lesson 01–02, 40% MFU, one H100-equivalent of compute):
prefill(8000) ≈ 8000 / 2830 ≈ 2.8 s
Stage 1 · Two reusable layers — cache both (lesson 06, SGLang premise)
RAG looks per-request unique, but it has the two-layer prefix structure the SGLang workload lesson opens on:
- Layer 1 — the system + tools prefix (~1.5K tokens) is byte-identical on every single request. Prefill it once; it lives permanently in the prefix cache at ~100% hit rate. Cost amortizes to zero.
- Layer 2 — the retrieved documents. A heavy knowledge service serves the same hot documents to many different queries; the SGLang lesson measures 60–80% byte reuse on this layer. So precompute and cache the KV of hot document chunks — ideally offline, as a background job over the corpus — keyed by chunk. On a hit, you skip prefilling those chunks entirely and attend to their cached KV via RadixAttention (lesson 06, SGLang 04).
The payoff: if system (1.5K) is always cached and 70% of the 7K retrieved tokens hit the doc-KV cache, the uncached prefill drops to 7000·0.3 ≈ 2,100 tokens → on the 70B, 2100/2830 ≈ 0.74 s. Add 100 ms retrieval → TTFT ≈ 0.85 s. Now under budget — but only because the cache hits.
Stage 2 · What doesn't cache — chunked prefill + a smaller model (lessons 04, 06)
The per-query unique tokens (the query, cold/cache-missed chunks) still prefill cold. Two levers, each justified by the wall:
- Chunked prefill (lesson 06, vLLM 10). An 8K cold prefill is a multi-hundred-ms compute monopoly; run it in one shot and it stalls every other request's decode (lesson 04's prefill/decode contention) — your decode TPOT spikes fleet-wide. Slice the prefill into chunks interleaved with decode steps so one heavy RAG prefill doesn't freeze the batch.
- Go smaller. Prefill cost ∝ N (the 2N rule). A 13B prefills at 990e12·0.4/(2·13e9) ≈ 15,200 tok/s — ~5.4× faster than the 70B. The same 2,100 uncached tokens now cost ~0.14 s. And RAG quality leans on retrieval (did you fetch the right chunk?) far more than on raw model size — the model is mostly reading and summarizing grounded text. Dropping from 70B to 13B is often a near-free latency win here.
Stage 3 · Freshness / cache invalidation — the new wrinkle
The doc-KV cache from stage 1 is what makes the budget feasible. But documents are edited through the day. The instant a document changes, its cached KV is stale — attending to it serves an answer grounded in the old text. This is the classic cache-coherence problem, now playing out over KV tensors.
The fix is ordinary cache discipline made explicit:
- Key the cache by
(doc_id, version), not bydoc_idalone. An edit bumps the version; the old KV is now unreachable and gets evicted by LRU, the new version is recomputed on first use. - Invalidate on the write path. The document store emits a change event → evict the affected chunks' KV → re-embed and re-index them for retrieval too (the ANN index is also a cache that goes stale). A chunk is only safely cached between edits.
Stage 4 · The retrieval subsystem, co-designed (lesson 05)
Retrieval is not free latency you inherit — it is a system you size, on its own, inside the shared 1.5 s budget. It has two pieces with different bottlenecks (the embeddings/ANN material in search · embeddings & ANN):
| Component | Bottleneck | Sizing & latency |
|---|---|---|
| Embedding service (encode the query) | throughput-bound encoder, short input | tiny model, batch the 500 req/s; ~5–15 ms |
| ANN index (vector search) | memory-resident vectors (HNSW/IVF) | RAM-sized to the corpus; ~10–40 ms |
| Reranker (score top-k) | small cross-encoder, compute-bound | rerank ~50 candidates; ~20–80 ms |
These scale independently of the LLM fleet — different hardware, different replica counts (the ANN index may be CPU/RAM-bound, not GPU-bound). They share only the latency budget, so the design rule is: keep the whole retrieval path under ~150 ms so prefill gets the lion's share of the 1.5 s. If retrieval creeps up, it directly steals headroom from prefill — the two subsystems are coupled only through that one number.
Interactive · RAG TTFT budget
Split the 1.5 s budget between retrieval and prefill, then watch the doc-cache hit rate decide whether you are prefill-bound (cache hotter, go smaller) or retrieval-bound (the LLM is fine, optimize the index).
What this case teaches
- Long retrieved context makes prefill the wall. Unlike the code assistant (near-static input) or the chatbot (cheap input, output-bound), here 8K tokens of fresh context per query dominate TTFT — and a 70B prefills it at only ~2,830 tok/s, blowing the budget cold.
- The two-layer prefix structure is the lever. The system prefix caches at ~100%; hot documents cache at 60–80%. Precomputing doc-KV converts most of the prefill into a cache lookup — the only reason 1.5 s is reachable.
- A second subsystem lives in the latency budget. Retrieval (embed + ANN + rerank) is co-designed and scaled independently of the LLM fleet, but every ms it spends is a ms prefill can't have.
- Freshness is a cache-coherence problem over KV. Key by
(doc_id, version)and invalidate on edit — and accept the real tension between staleness and the hit rate the budget depends on. This wrinkle simply doesn't exist in the static-corpus cases. - Smaller is often free. Prefill ∝ N and RAG quality rides on retrieval, so a 13B can be ~5× faster with little quality loss — the opposite instinct from "use the biggest model."