Cache-aware scheduling — order is the optimization

A radix tree is a passive index. The scheduler is what turns it into a throughput win. By admitting requests in the order that maximizes prefix overlap, SGLang can convert the tree's capability for sharing into actual hit rate — often 30–60 percentage points above first-come-first-served.

The shape of the win

Suppose 32 requests arrive simultaneously and 24 of them share a 2,000-token system prompt while 8 are unrelated. The radix tree can in principle hold all 32 prefixes; the question is whether the scheduler ever lets it. Three scheduling policies, three outcomes:

Policy	Admission order	What the tree sees	Effective hit rate
FCFS	arrival order	random walk over prompts	~50% (depends on luck)
Shortest job first	by predicted output length	still random over prefixes	~50%
Longest-prefix-match (LPM)	by depth of cache hit	shared-prefix requests first	>95%

The reason LPM dominates is mechanical: the first request through a fresh tree pays the full prefill cost; the rest piggyback. Letting the cache-hitting requests run first means everyone benefits, including the ones that would have missed.

The scheduling problem, formally

At each step the engine has:

A waiting queue of admitted-but-not-running requests, each with a length and a cached prefix length (from a lookup in the radix tree).
A running set currently in the continuous-batch loop.
A KV budget: free blocks in the pool.

The decision: which waiting requests to admit into the running set this step. Constraints: the running set's KV must fit in the budget; some fairness floor (no request waits forever); some prefill-throughput cap (so prefill chunks don't starve decodes).

The objective: maximize aggregate useful tokens generated per GPU-second, where "useful" means tokens that aren't redundant prefill. This collapses, roughly, to:

maximize ∑_r∈running (input_len(r) − cached_prefix_len(r)) + decode_tokens(r)

The cache term is the entire knob the scheduler has — every other term is workload-given. Maximize it by picking, from the waiting queue, the requests whose cached_prefix_len is deepest.

The LPM-first algorithm

def schedule_one_step(waiting, running, kv_budget, tree):
    # 1. for each waiting request, look up its deepest cache hit
    for r in waiting:
        r.cached_len, r.node = tree.lookup(r.tokens)

    # 2. sort by cached_len descending (LPM-first)
    waiting.sort(key=lambda r: r.cached_len, reverse=True)

    # 3. greedy-admit while KV budget holds; aged requests bypass the budget
    admitted = []
    for r in waiting:
        new_blocks = blocks_needed(r) - r.node.cached_blocks  # what we have to allocate
        if r.wait_age > FAIRNESS_T:                # always admit aged requests
            admitted.append(r); kv_budget -= new_blocks
        elif new_blocks <= kv_budget:              # otherwise gate on budget
            admitted.append(r); kv_budget -= new_blocks

    return admitted

Three pieces beyond a simple sort:

"Cached blocks" is subtracted from the budget. An LPM-first request that hits 2,000 tokens of cache costs the pool only the new suffix blocks — the cached blocks already live in the tree and have their refcounts bumped, not freshly allocated.
Greedy admission has a fairness escape. Any request waiting more than FAIRNESS_T jumps the queue. Without this, a hot prefix can starve cold requests forever.
Eviction reverses the order. When the pool is tight, evict leaves whose subtrees have lowest aggregate hit benefit (cold prefixes). The LPM-first admit and LPM-aware evict cooperate.

The fairness ↔ hit-rate trade-off

SGLang ships with FAIRNESS_T tuned for typical online workloads (a few hundred milliseconds in default config). For offline batch workloads — where there is no SLA — you can raise it freely and push hit rates above 95%. The right answer depends on whether you are pricing latency or throughput.

Why FCFS leaves so much on the floor — a concrete trace

Set up: 6 requests, 2 system prompts (A: 2,000 tokens · B: 1,500 tokens), arrival order alternates A,B,A,B,A,B. KV budget allows 4 concurrent.

How the scheduler talks to the tree

Three queries on the tree from the scheduler, two updates:

Call	Purpose	Cost
`tree.lookup(tokens)`	For ranking: how deep does this request hit?	O(T) compares
`tree.evictable_size()`	For admission: how many blocks could the pool reclaim if needed?	O(1) cached field
`tree.protect(node)`	For correctness: pin a node during a step so eviction won't free it.	O(1)
`tree.insert(tokens, blocks)`	After prefill: add the new suffix to the tree.	O(suffix length)
`tree.release(leaf)`	On request completion: decrement refcounts.	O(depth)

The scheduler is essentially a producer for these calls, plus the LPM-sort. SGLang's implementation lives in python/sglang/srt/managers/scheduler.py and tree_cache.py — about a thousand lines total. Worth reading after this lesson.

Continuous batching, briefly

All of this happens inside an iteration-level batched loop (Orca / vLLM lesson 04). At each decode step:

Compute decode for currently-running requests (one token each).
Pick up to max_prefill_tokens worth of waiting requests, prefill them in the same forward pass (chunked prefill, vLLM lesson 10).
Drop finished requests from the running set; release their KV via the tree.
Repeat.

Cache-aware scheduling only changes step 2 — which waiting requests are picked up next. The continuous-batch machinery is unchanged.

What can go wrong

Failure modes worth knowing

Hot-prefix starvation. Without FAIRNESS_T, a 10-user shared prompt starves single-user prompts. The fairness bump is not optional.
Tree blow-up on adversarial inputs. If every incoming prompt has a unique prefix, the tree degenerates to one leaf per request. Throughput equals "vLLM without APC" — no worse, no better.
Eviction thrash. Tight KV budget + churn over many small prefixes can cause LPM benefits to flap. Cure: increase pool size (gpu-memory-utilization), or accept lower hit rate.
p99 wait under load shock. A burst of arrivals with no shared prefix forces the scheduler to admit serially. SGLang doesn't beat physics; it beats waste.

Interactive · LPM-first vs FCFS

Simulate a stream of requests with adjustable shared-prefix rate. Compare hit rate and average wait time under FCFS vs LPM-first with bounded reordering.