Case study — an agentic tool-use platform
Every prior case bound on a property of a single model call: the code assistant (13) on prefill latency, the chatbot (14) on concurrency, the RAG assistant (15) on retrieval + long-context prefill. Here the wall moves up a level. A request is no longer one call — it is a sequence of them, and the binding constraint is end-to-end task latency built from sequential fan-out and tail amplification (lesson 03), compounded by statefulness: a task is pinned to its KV. This is a reliability-and-scheduling problem wearing a serving problem's clothes — not a throughput one.
Run the loop. Before reading on: what binds — TTFT, TPOT, memory, throughput? None of them directly. The wall is a property of the composition of calls, and it forces a stateful design.
Stage 0 · Sequential fan-out and tail amplification (lesson 03)
The calls are sequential, not parallel — call k needs call k−1's output (the model must see the tool result before it reasons again). So task latency is a sum, not a max:
A sum of n calls inherits the worst of them. Lesson 03's fan-out result applies directly: the chance the task hits at least one p99-slow call is 1 − 0.99ⁿ. At n = 20 that is 1 − 0.99²⁰ ≈ 18%. Nearly one task in five trips a p99 call — so your call-level p99 IS your task-level typical case. Designing to call-level median is self-deception; the task sees the tail every time.
Stage 1 · Statefulness & KV residency (lessons 05, 13)
Now the redemption. Each call's context is the previous call's context plus the new tool output — i.e. call k's prompt is call k−1's prompt as a prefix. This is the exact prefix-caching premise from the code assistant (13) and SGLang (SGLang 04), but along the time axis of one task. With the history KV resident, call k re-prefills only the new tool-output + reasoning tokens (a few hundred), not the whole 30K history:
That is a ~100× collapse in the prefill component of every call after the first. But it is conditional on two things, both from lesson 05:
- Session-sticky / cache-aware routing. Every call of a task must land on the replica holding that task's KV. Round-robin would give a cold 30K prefill on most calls and resurrect the ~2.7 s penalty — multiplied by n. Route by task-id, not load. Link SGLang 05.
- KV residency for the task's lifetime. A 30K-token history at 70B GQA-8 (~320 KB/token, lesson 02/13) is 30000 · 320 KB ≈ 9.6 GB of KV per active task — and it must stay pinned for the whole 30–60 s the task runs, not just one call. That number, not FLOPs, sets how many tasks fit a replica.
Stage 2 · Tool latency doesn't overlap — so multiplex across tasks (lesson 04)
While a sandboxed tool runs (100 ms – several s), generation for that task is blocked on a sequential dependency. If a replica served one task, the GPU would sit idle for that entire tool window — pure waste. The fix is the same continuous batching from lesson 04, but applied across tasks: the replica decodes other tasks while this one waits on its tool. A replica is a multiplexer over many tasks at different stages (some thinking, most waiting on tools).
This is why high task-concurrency per replica is good for utilization even though each task is mostly idle: with tool windows averaging, say, 2 s against ~1 s of active generation per call, a task spends ~⅔ of its life waiting — so you need ~3× the concurrent tasks just to keep the GPU busy. The 16-task residency cap from stage 1 and this multiplexing requirement are in direct tension: residency limits how many tasks you can keep warm, and you need many warm tasks to stay utilized. That tension is the heart of sizing this fleet.
| Lever | Effect on task latency | Effect on utilization / cost |
|---|---|---|
| Prefix-cache the history (sticky routing) | ~100× per-call prefill cut | frees compute, but pins 9.6 GB KV/task |
| Multiplex tasks across tool waits | none (per-task) | recovers the idle GPU during tool exec |
| Cut number of calls n | linear and tail reduction | fewer KV-pinned seconds per task |
Stage 3 · Failure & statefulness is the hard part (lesson 11)
Stateless serving (cases 13–15) lets you treat replicas as cattle: drain, kill, reroute, autoscale freely. Statefulness breaks all of that.
- Failover blast radius. If a sticky replica dies mid-task, every task pinned there loses its KV and its progress. Recovery means either re-prefilling the entire 30K history (~2.7 s of compute per task, times all pinned tasks) or restoring from a checkpoint of task state (the message/tool-call transcript) and re-prefilling from there.
- Autoscaling can't drain instantly. You cannot scale a replica down while it holds live 30–60 s tasks; you must stop routing new tasks to it and wait for the in-flight ones to finish — a drain that takes as long as the longest task. Scale-down is slow by construction.
- Checkpoint the transcript, not the KV. KV is huge (9.6 GB) and replica-local; the transcript (system prompt + messages + tool outputs) is small and portable. Persist the transcript per task so any replica can rehydrate by re-prefilling — trading compute for recoverability.
Stage 4 · Levers to cut calls and tighten the tail (lessons 06, 13)
Stage 0 said the only levers are fewer calls and lower call p99. Concretely:
- Speculative decoding for the short tool-call / JSON outputs that dominate agent turns. Agent calls are small-batch and latency-critical — exactly where spec decoding pays (lesson 06): the bandwidth-bound decode of a 30-token tool call gets a draft model's free ride.
- Constrained / grammar decoding so every tool call is syntactically valid the first time. An invalid call is a wasted round-trip — it adds a call to n, the most expensive thing you can do. Fewer retries directly shrinks the fan-out.
- Tiered models (like 13): a smaller/faster model for routine steps (formatting, simple tool dispatch), the 70B only for the hard reasoning steps. Most of n calls are routine; serving them on a 7B cuts both their latency and their KV footprint.
Interactive · task latency & the tail
Build a task out of calls and watch the tail. Turn prefix caching off and see re-prefill of the growing history wreck both latency and cost; raise the call count and watch 1 − 0.99ⁿ push the p95 estimate past 60 s even when the median looks fine.
What this case teaches
- The wall moved up a level — to composition. A task isn't one call; it's a sequential sum, so it inherits per-call p99 via lesson 03's 1 − 0.99ⁿ. The only levers on task latency are fewer calls and a tighter call-level tail. Validate at call p99, not median.
- Statefulness is the design. Prefix-caching the agent history is a ~100× win conditional on sticky routing + KV residency (~9.6 GB/task), which caps task concurrency and turns failover, drain, and autoscale into the hard problems (lessons 05, 11) — unlike the stateless cases 13–15.
- Multiplex tasks across tool waits for utilization. Sequential tool latency idles a task's GPU; continuous batching across many tasks (lesson 04) recovers it — so high per-replica task concurrency is good even though each task mostly waits, in tension with the residency cap.
- Contrast. Same ordinary 70B as the capstone; because the workload is a sequence of stateful calls with non-overlapping tool waits, the binding wall is end-to-end tail + statefulness — a scheduling/reliability problem, not the throughput or TTFT walls of every prior case.