Case study — a RAG knowledge assistant

The code assistant (lesson 13) had a near-static input and tiny output; the consumer chatbot (lesson 14) had cheap input and a concurrency-via-output wall. This one inverts both: the input is huge and changing — 6–8K tokens of freshly retrieved context per query — so the binding wall is the prefill cost of long context. Worse, there is now a second subsystem (retrieval) living inside the same latency budget, and the documents go stale during the day — a cache-coherence wrinkle no earlier case had.

The brief

An enterprise knowledge assistant. Each query: retrieve top-10 chunks ≈ 6–8K tokens of context, prepended to a ~1.5K-token system + tools prefix. Output ~400 tokens. SLO: p95 TTFT < 1.5 s — including retrieval. Load ~500 req/s. Documents are edited through the day, so freshness matters: a stale answer is a wrong answer.

Run the loop. Before reading on: where does the 1.5 s go — retrieval or prefill? Is a bigger model affordable here? And what breaks the moment a document is edited?

Stage 0 · Requirements → the number that matters (lesson 03)

The new structure: TTFT now has two consumers, not one.

TTFT = retrieval_latency + prefill_latency + (first decode step)

Retrieval = embed the query + ANN search + rerank ≈ 50–150 ms. That alone eats up to 10% of the budget before the LLM sees a token. Prefill is the rest — and it is large, because the input is ~8K tokens (7K retrieved + 1.5K system, minus overlap). Output is only 400 tokens, so by Little's Law decode concurrency is modest (W ≈ 1.5 + 400·0.03 ≈ 13.5 s, L = 500·13.5 ≈ 6,750 — a normal fleet). This is a TTFT-via-long-prefill problem, with a retrieval tax on top.

Do the prefill arithmetic on a 70B (compute-bound, lesson 01–02, 40% MFU, one H100-equivalent of compute):

prefill_tps = 990e12 · 0.4 / (2 · 70e9) ≈ 2,830 tokens/s
prefill(8000) ≈ 8000 / 2830 ≈ 2.8 s

Binding constraint, stage 0

Prefill of the long retrieved context dominates and blows the budget. At 2.8 s on a 70B, prefill alone is nearly 2× the entire 1.5 s SLO — before adding the 50–150 ms of retrieval. The naive "big model, prepend everything, prefill it cold" design fails on arithmetic, exactly as the code assistant did — but here the redundancy is in the documents, not the user's file.

Stage 1 · Two reusable layers — cache both (lesson 06, SGLang premise)

RAG looks per-request unique, but it has the two-layer prefix structure the SGLang workload lesson opens on:

Layer 1 — the system + tools prefix (~1.5K tokens) is byte-identical on every single request. Prefill it once; it lives permanently in the prefix cache at ~100% hit rate. Cost amortizes to zero.
Layer 2 — the retrieved documents. A heavy knowledge service serves the same hot documents to many different queries; the SGLang lesson measures 60–80% byte reuse on this layer. So precompute and cache the KV of hot document chunks — ideally offline, as a background job over the corpus — keyed by chunk. On a hit, you skip prefilling those chunks entirely and attend to their cached KV via RadixAttention (lesson 06, SGLang 04).

The payoff: if system (1.5K) is always cached and 70% of the 7K retrieved tokens hit the doc-KV cache, the uncached prefill drops to 7000·0.3 ≈ 2,100 tokens → on the 70B, 2100/2830 ≈ 0.74 s. Add 100 ms retrieval → TTFT ≈ 0.85 s. Now under budget — but only because the cache hits.

Stage 2 · What doesn't cache — chunked prefill + a smaller model (lessons 04, 06)

The per-query unique tokens (the query, cold/cache-missed chunks) still prefill cold. Two levers, each justified by the wall:

Chunked prefill (lesson 06, vLLM 10). An 8K cold prefill is a multi-hundred-ms compute monopoly; run it in one shot and it stalls every other request's decode (lesson 04's prefill/decode contention) — your decode TPOT spikes fleet-wide. Slice the prefill into chunks interleaved with decode steps so one heavy RAG prefill doesn't freeze the batch.
Go smaller. Prefill cost ∝ N (the 2N rule). A 13B prefills at 990e12·0.4/(2·13e9) ≈ 15,200 tok/s — ~5.4× faster than the 70B. The same 2,100 uncached tokens now cost ~0.14 s. And RAG quality leans on retrieval (did you fetch the right chunk?) far more than on raw model size — the model is mostly reading and summarizing grounded text. Dropping from 70B to 13B is often a near-free latency win here.

Stage 3 · Freshness / cache invalidation — the new wrinkle

The doc-KV cache from stage 1 is what makes the budget feasible. But documents are edited through the day. The instant a document changes, its cached KV is stale — attending to it serves an answer grounded in the old text. This is the classic cache-coherence problem, now playing out over KV tensors.

The fix is ordinary cache discipline made explicit:

Key the cache by (doc_id, version), not by doc_id alone. An edit bumps the version; the old KV is now unreachable and gets evicted by LRU, the new version is recomputed on first use.
Invalidate on the write path. The document store emits a change event → evict the affected chunks' KV → re-embed and re-index them for retrieval too (the ANN index is also a cache that goes stale). A chunk is only safely cached between edits.

Binding constraint, stage 3

Correctness vs. the hit rate that makes the latency budget feasible. A long cache TTL maximizes the doc-KV hit rate (and keeps TTFT under 1.5 s) but widens the window where you serve stale, wrong answers. A short TTL is always fresh but collapses the hit rate, pushing prefill — and TTFT — back over budget. There is no free setting: you tune the staleness window against the freshness SLA the product actually needs (a legal-docs corpus and a status-page corpus want very different windows). Frequently-edited docs simply should not be cached.

Stage 4 · The retrieval subsystem, co-designed (lesson 05)

Retrieval is not free latency you inherit — it is a system you size, on its own, inside the shared 1.5 s budget. It has two pieces with different bottlenecks (the embeddings/ANN material in search · embeddings & ANN):

Component	Bottleneck	Sizing & latency
Embedding service (encode the query)	throughput-bound encoder, short input	tiny model, batch the 500 req/s; ~5–15 ms
ANN index (vector search)	memory-resident vectors (HNSW/IVF)	RAM-sized to the corpus; ~10–40 ms
Reranker (score top-k)	small cross-encoder, compute-bound	rerank ~50 candidates; ~20–80 ms

These scale independently of the LLM fleet — different hardware, different replica counts (the ANN index may be CPU/RAM-bound, not GPU-bound). They share only the latency budget, so the design rule is: keep the whole retrieval path under ~150 ms so prefill gets the lion's share of the 1.5 s. If retrieval creeps up, it directly steals headroom from prefill — the two subsystems are coupled only through that one number.

Binding constraint, final

Prefill latency, gated by the doc-cache hit rate — with cache coherence as the tax you pay to keep that hit rate honest. Retrieval is a fixed ~150 ms slice; the variable that decides whether you clear 1.5 s is how much of the retrieved context you can serve from cached KV. So the system's real job, as in the code assistant (13), is maximizing hit rate — here complicated by documents that change underneath the cache.

Interactive · RAG TTFT budget

Split the 1.5 s budget between retrieval and prefill, then watch the doc-cache hit rate decide whether you are prefill-bound (cache hotter, go smaller) or retrieval-bound (the LLM is fine, optimize the index).

RAG TTFT budget

prefill_tps = 990e12·0.4/(2·N·1e9). System prefix always hits the cache (prefill-once); of the retrieved tokens, only the hit fraction is cached. Uncached = total − cached; prefill_time = uncached / prefill_tps; TTFT = retrieval + prefill_time + 3 ms. Order-of-magnitude per the series' ±30% contract; single H100-equivalent of compute.

model params N (B) 70 retrieved context (tok) 7000 system prefix (tok) 1500 doc-cache hit rate % 70 retrieval latency (ms) 100

uncached tokens

–

prefill time

–

TTFT

–

verdict vs 1.5 s

–

What this case teaches

Long retrieved context makes prefill the wall. Unlike the code assistant (near-static input) or the chatbot (cheap input, output-bound), here 8K tokens of fresh context per query dominate TTFT — and a 70B prefills it at only ~2,830 tok/s, blowing the budget cold.
The two-layer prefix structure is the lever. The system prefix caches at ~100%; hot documents cache at 60–80%. Precomputing doc-KV converts most of the prefill into a cache lookup — the only reason 1.5 s is reachable.
A second subsystem lives in the latency budget. Retrieval (embed + ANN + rerank) is co-designed and scaled independently of the LLM fleet, but every ms it spends is a ms prefill can't have.
Freshness is a cache-coherence problem over KV. Key by (doc_id, version) and invalidate on edit — and accept the real tension between staleness and the hit rate the budget depends on. This wrinkle simply doesn't exist in the static-corpus cases.
Smaller is often free. Prefill ∝ N and RAG quality rides on retrieval, so a 13B can be ~5× faster with little quality loss — the opposite instinct from "use the biggest model."