The workload that demands a new framework
Modern LLM workloads are not "one prompt, one completion." They are programs — agent loops, tree searches, multi-sample RAG. Once you measure them, somewhere between 50% and 95% of the FLOPs spent are redundant prefill on a shared prefix. That redundancy is what SGLang is built to delete.
The model you have in your head is too small
Most serving frameworks were designed against the chat-completion benchmark: a user sends one prompt, the model emits one completion, the connection closes. That picture is sized for a chatbot and tells you nothing about the workloads people now actually run in production. The four patterns below dominate today's bills:
Quantify the waste, one workload at a time
Throwing terms around like "tree of thought" is unhelpful without numbers. Let's anchor each pattern to the bytes and FLOPs it actually spends.
Agent loop
A coding agent has a system prompt of ~2,500 tokens (tool definitions, formatting rules, examples) and runs ~30 turns per task. By turn 30 the conversation history is ~12K tokens. If the engine treats each turn as an independent request, the cost of prefilling every turn from scratch is:
The "useful" prefill — the part that wasn't in the previous turn's KV — totals about 13,000 tokens. About 95% of the prefill is recompute. At Llama-3-70B's ~140 TFLOPs / 1,000 tokens (from the 2N rule of thumb), that's ~33 PFLOPs of redundant compute per conversation — tens of GPU-seconds wasted on every conversation.
Tree of thought
Tree-of-Thought (Yao et al., 2023) and its descendants explore k branches per node and beam-search the best path. A typical configuration: 4 branches per node × 3 levels of depth × 8 samples per branch = 96 leaf samples, all sharing the same first ~1.5K tokens of problem statement. Without prefix sharing the engine prefills the statement 96 times.
Self-consistency
Self-consistency (Wang et al., 2022) takes n=32 samples from one prompt and majority-votes the answer. This is the cleanest case: 1 unique prefix, 32 decode tracks. A naive engine still pays 32× the prefill cost. Even vLLM's automatic prefix cache only helps if you submit the 32 samples close enough in time that none have been evicted.
RAG
Retrieval-augmented generation looks per-request unique (the query varies), but examined more carefully it has at least two reusable layers: (1) the system+tools prefix, identical across all requests, and (2) the retrieved documents, which are shared across users hitting the same document. A heavy RAG service typically retrieves the same top-10 documents for many different queries — the second layer alone reuses 60–80% of its bytes.
Map the redundancy onto bytes
The bytes-per-token KV math you already know (vLLM lesson 01) determines whether the redundancy is recoverable. For Llama-3-70B at GQA-8:
= 2 · 8 · 128 · 80 · 2 = 320 KB / token (full precision) → ~160 KB/token at fp8
So 2,500 tokens of shared system prompt at fp16 occupies ≈ 800 MB of KV. If 100 active users all share that prompt, the choice is between holding 100 copies (80 GB — most of an H100) or 1 copy (800 MB) plus an indirection per attention. The trade is not subtle.
The compute side, briefly
KV bytes are only half the story. The other half: by the 2N-FLOPs rule, a prefill of 2,500 tokens on Llama-70B costs ≈ 2 × 70 × 10⁹ × 2,500 ≈ 350 TFLOPs end-to-end (across all layers). On an H100 sustaining ~700 TF/s of fp16 dense throughput that's about 0.5 s of GPU time — for each redundant prefill. Multiply by users and turns and you're spending whole GPU-hours per day on bytes the engine already computed minutes ago.
Interactive · cost of a stateless engine
Set the workload shape. The widget shows the total prefill cost a stateless engine pays versus the floor a prefix-sharing engine pays. The gap is what SGLang is trying to close.
What the framework needs to do
From the four patterns above, a small list of asks falls out:
- Reuse KV across calls, not just within one batch. If two requests arrive 30 seconds apart with the same system prompt, the framework should serve them both off the same KV blocks.
- Reuse KV across partial prefixes. Two RAG calls that share 1,800 tokens of system + tools but diverge on document choice should share those 1,800 tokens' KV — not be treated as different prefixes because the suffix differs.
- Schedule with the cache in mind. If reordering arrivals lets more requests hit the cache, do it (within fairness limits). FCFS leaves throughput on the floor.
- Express program structure to the runtime. If the client knows it's about to fork 32 branches off the same prefix, the runtime should not have to infer that — the client should be able to say it.
- Constrain outputs at sample-time, not as a post-hoc parse. Most agent outputs are JSON or function calls. Masking illegal tokens during sampling is far cheaper than re-rolling on a parse failure.
Items 1–3 motivate RadixAttention and the cache-aware scheduler (lessons 04–05). Item 4 motivates the frontend DSL (lesson 02). Item 5 motivates the structured decoding stack (lessons 06–07). The rest of the series builds those mechanisms one at a time.