all_lessons / ml_system_design / 15 · case · RAG lesson 15 / 20

Case study — a RAG knowledge assistant

The code assistant (lesson 13) had a near-static input and tiny output; the consumer chatbot (lesson 14) had cheap input and a concurrency-via-output wall. This one inverts both: the input is huge and changing — 6–8K tokens of freshly retrieved context per query — so the binding wall is the prefill cost of long context. Worse, there is now a second subsystem (retrieval) living inside the same latency budget, and the documents go stale during the day — a cache-coherence wrinkle no earlier case had.

The brief
An enterprise knowledge assistant. Each query: retrieve top-10 chunks ≈ 6–8K tokens of context, prepended to a ~1.5K-token system + tools prefix. Output ~400 tokens. SLO: p95 TTFT < 1.5 s — including retrieval. Load ~500 req/s. Documents are edited through the day, so freshness matters: a stale answer is a wrong answer.

Run the loop. Before reading on: where does the 1.5 s go — retrieval or prefill? Is a bigger model affordable here? And what breaks the moment a document is edited?

Stage 0 · Requirements → the number that matters (lesson 03)

The new structure: TTFT now has two consumers, not one.

TTFT = retrieval_latency + prefill_latency + (first decode step)

Retrieval = embed the query + ANN search + rerank ≈ 50–150 ms. That alone eats up to 10% of the budget before the LLM sees a token. Prefill is the rest — and it is large, because the input is ~8K tokens (7K retrieved + 1.5K system, minus overlap). Output is only 400 tokens, so by Little's Law decode concurrency is modest (W ≈ 1.5 + 400·0.03 ≈ 13.5 s, L = 500·13.5 ≈ 6,750 — a normal fleet). This is a TTFT-via-long-prefill problem, with a retrieval tax on top.

Do the prefill arithmetic on a 70B (compute-bound, lesson 01–02, 40% MFU, one H100-equivalent of compute):

prefill_tps = 990e12 · 0.4 / (2 · 70e9) ≈ 2,830 tokens/s
prefill(8000) ≈ 8000 / 2830 ≈ 2.8 s
Binding constraint, stage 0
Prefill of the long retrieved context dominates and blows the budget. At 2.8 s on a 70B, prefill alone is nearly 2× the entire 1.5 s SLO — before adding the 50–150 ms of retrieval. The naive "big model, prepend everything, prefill it cold" design fails on arithmetic, exactly as the code assistant did — but here the redundancy is in the documents, not the user's file.

Stage 1 · Two reusable layers — cache both (lesson 06, SGLang premise)

RAG looks per-request unique, but it has the two-layer prefix structure the SGLang workload lesson opens on:

system + tools retrieved top-10 chunks query ~1.5K tok ~6–8K tok ~80 tok blue: identical across ALL requests → ~100% cache hit, prefill once. orange: shared across users hitting the same hot docs → 60–80% reusable. green: truly unique.

The payoff: if system (1.5K) is always cached and 70% of the 7K retrieved tokens hit the doc-KV cache, the uncached prefill drops to 7000·0.3 ≈ 2,100 tokens → on the 70B, 2100/2830 ≈ 0.74 s. Add 100 ms retrieval → TTFT ≈ 0.85 s. Now under budget — but only because the cache hits.

Stage 2 · What doesn't cache — chunked prefill + a smaller model (lessons 04, 06)

The per-query unique tokens (the query, cold/cache-missed chunks) still prefill cold. Two levers, each justified by the wall:

Stage 3 · Freshness / cache invalidation — the new wrinkle

The doc-KV cache from stage 1 is what makes the budget feasible. But documents are edited through the day. The instant a document changes, its cached KV is stale — attending to it serves an answer grounded in the old text. This is the classic cache-coherence problem, now playing out over KV tensors.

The fix is ordinary cache discipline made explicit:

Binding constraint, stage 3
Correctness vs. the hit rate that makes the latency budget feasible. A long cache TTL maximizes the doc-KV hit rate (and keeps TTFT under 1.5 s) but widens the window where you serve stale, wrong answers. A short TTL is always fresh but collapses the hit rate, pushing prefill — and TTFT — back over budget. There is no free setting: you tune the staleness window against the freshness SLA the product actually needs (a legal-docs corpus and a status-page corpus want very different windows). Frequently-edited docs simply should not be cached.

Stage 4 · The retrieval subsystem, co-designed (lesson 05)

Retrieval is not free latency you inherit — it is a system you size, on its own, inside the shared 1.5 s budget. It has two pieces with different bottlenecks (the embeddings/ANN material in search · embeddings & ANN):

ComponentBottleneckSizing & latency
Embedding service (encode the query)throughput-bound encoder, short inputtiny model, batch the 500 req/s; ~5–15 ms
ANN index (vector search)memory-resident vectors (HNSW/IVF)RAM-sized to the corpus; ~10–40 ms
Reranker (score top-k)small cross-encoder, compute-boundrerank ~50 candidates; ~20–80 ms

These scale independently of the LLM fleet — different hardware, different replica counts (the ANN index may be CPU/RAM-bound, not GPU-bound). They share only the latency budget, so the design rule is: keep the whole retrieval path under ~150 ms so prefill gets the lion's share of the 1.5 s. If retrieval creeps up, it directly steals headroom from prefill — the two subsystems are coupled only through that one number.

Binding constraint, final
Prefill latency, gated by the doc-cache hit rate — with cache coherence as the tax you pay to keep that hit rate honest. Retrieval is a fixed ~150 ms slice; the variable that decides whether you clear 1.5 s is how much of the retrieved context you can serve from cached KV. So the system's real job, as in the code assistant (13), is maximizing hit rate — here complicated by documents that change underneath the cache.

Interactive · RAG TTFT budget

Split the 1.5 s budget between retrieval and prefill, then watch the doc-cache hit rate decide whether you are prefill-bound (cache hotter, go smaller) or retrieval-bound (the LLM is fine, optimize the index).

RAG TTFT budget

prefill_tps = 990e12·0.4/(2·N·1e9). System prefix always hits the cache (prefill-once); of the retrieved tokens, only the hit fraction is cached. Uncached = total − cached; prefill_time = uncached / prefill_tps; TTFT = retrieval + prefill_time + 3 ms. Order-of-magnitude per the series' ±30% contract; single H100-equivalent of compute.

uncached tokens
prefill time
TTFT
verdict vs 1.5 s

What this case teaches