Cache-aware scheduling — order is the optimization
A radix tree is a passive index. The scheduler is what turns it into a throughput win. By admitting requests in the order that maximizes prefix overlap, SGLang can convert the tree's capability for sharing into actual hit rate — often 30–60 percentage points above first-come-first-served.
The shape of the win
Suppose 32 requests arrive simultaneously and 24 of them share a 2,000-token system prompt while 8 are unrelated. The radix tree can in principle hold all 32 prefixes; the question is whether the scheduler ever lets it. Three scheduling policies, three outcomes:
| Policy | Admission order | What the tree sees | Effective hit rate |
|---|---|---|---|
| FCFS | arrival order | random walk over prompts | ~50% (depends on luck) |
| Shortest job first | by predicted output length | still random over prefixes | ~50% |
| Longest-prefix-match (LPM) | by depth of cache hit | shared-prefix requests first | >95% |
The reason LPM dominates is mechanical: the first request through a fresh tree pays the full prefill cost; the rest piggyback. Letting the cache-hitting requests run first means everyone benefits, including the ones that would have missed.
The scheduling problem, formally
At each step the engine has:
- A waiting queue of admitted-but-not-running requests, each with a length and a cached prefix length (from a lookup in the radix tree).
- A running set currently in the continuous-batch loop.
- A KV budget: free blocks in the pool.
The decision: which waiting requests to admit into the running set this step. Constraints: the running set's KV must fit in the budget; some fairness floor (no request waits forever); some prefill-throughput cap (so prefill chunks don't starve decodes).
The objective: maximize aggregate useful tokens generated per GPU-second, where "useful" means tokens that aren't redundant prefill. This collapses, roughly, to:
The cache term is the entire knob the scheduler has — every other term is workload-given. Maximize it by picking, from the waiting queue, the requests whose cached_prefix_len is deepest.
The LPM-first algorithm
def schedule_one_step(waiting, running, kv_budget, tree):
# 1. for each waiting request, look up its deepest cache hit
for r in waiting:
r.cached_len, r.node = tree.lookup(r.tokens)
# 2. sort by cached_len descending (LPM-first)
waiting.sort(key=lambda r: r.cached_len, reverse=True)
# 3. greedy-admit while KV budget holds; aged requests bypass the budget
admitted = []
for r in waiting:
new_blocks = blocks_needed(r) - r.node.cached_blocks # what we have to allocate
if r.wait_age > FAIRNESS_T: # always admit aged requests
admitted.append(r); kv_budget -= new_blocks
elif new_blocks <= kv_budget: # otherwise gate on budget
admitted.append(r); kv_budget -= new_blocks
return admitted
Three pieces beyond a simple sort:
- "Cached blocks" is subtracted from the budget. An LPM-first request that hits 2,000 tokens of cache costs the pool only the new suffix blocks — the cached blocks already live in the tree and have their refcounts bumped, not freshly allocated.
- Greedy admission has a fairness escape. Any request waiting more than
FAIRNESS_Tjumps the queue. Without this, a hot prefix can starve cold requests forever. - Eviction reverses the order. When the pool is tight, evict leaves whose subtrees have lowest aggregate hit benefit (cold prefixes). The LPM-first admit and LPM-aware evict cooperate.
The fairness ↔ hit-rate trade-off
SGLang ships with FAIRNESS_T tuned for typical online workloads (a few hundred milliseconds in default config). For offline batch workloads — where there is no SLA — you can raise it freely and push hit rates above 95%. The right answer depends on whether you are pricing latency or throughput.
Why FCFS leaves so much on the floor — a concrete trace
Set up: 6 requests, 2 system prompts (A: 2,000 tokens · B: 1,500 tokens), arrival order alternates A,B,A,B,A,B. KV budget allows 4 concurrent.
How the scheduler talks to the tree
Three queries on the tree from the scheduler, two updates:
| Call | Purpose | Cost |
|---|---|---|
tree.lookup(tokens) | For ranking: how deep does this request hit? | O(T) compares |
tree.evictable_size() | For admission: how many blocks could the pool reclaim if needed? | O(1) cached field |
tree.protect(node) | For correctness: pin a node during a step so eviction won't free it. | O(1) |
tree.insert(tokens, blocks) | After prefill: add the new suffix to the tree. | O(suffix length) |
tree.release(leaf) | On request completion: decrement refcounts. | O(depth) |
The scheduler is essentially a producer for these calls, plus the LPM-sort. SGLang's implementation lives in python/sglang/srt/managers/scheduler.py and tree_cache.py — about a thousand lines total. Worth reading after this lesson.
Continuous batching, briefly
All of this happens inside an iteration-level batched loop (Orca / vLLM lesson 04). At each decode step:
- Compute decode for currently-running requests (one token each).
- Pick up to
max_prefill_tokensworth of waiting requests, prefill them in the same forward pass (chunked prefill, vLLM lesson 10). - Drop finished requests from the running set; release their KV via the tree.
- Repeat.
Cache-aware scheduling only changes step 2 — which waiting requests are picked up next. The continuous-batch machinery is unchanged.
What can go wrong
- Hot-prefix starvation. Without FAIRNESS_T, a 10-user shared prompt starves single-user prompts. The fairness bump is not optional.
- Tree blow-up on adversarial inputs. If every incoming prompt has a unique prefix, the tree degenerates to one leaf per request. Throughput equals "vLLM without APC" — no worse, no better.
- Eviction thrash. Tight KV budget + churn over many small prefixes can cause LPM benefits to flap. Cure: increase pool size (gpu-memory-utilization), or accept lower hit rate.
- p99 wait under load shock. A burst of arrivals with no shared prefix forces the scheduler to admit serially. SGLang doesn't beat physics; it beats waste.
Interactive · LPM-first vs FCFS
Simulate a stream of requests with adjustable shared-prefix rate. Compare hit rate and average wait time under FCFS vs LPM-first with bounded reordering.