Case study — an agentic tool-use platform

Every prior case bound on a property of a single model call: the code assistant (13) on prefill latency, the chatbot (14) on concurrency, the RAG assistant (15) on retrieval + long-context prefill. Here the wall moves up a level. A request is no longer one call — it is a sequence of them, and the binding constraint is end-to-end task latency built from sequential fan-out and tail amplification (lesson 03), compounded by statefulness: a task is pinned to its KV. This is a reliability-and-scheduling problem wearing a serving problem's clothes — not a throughput one.

The brief

A platform running autonomous agents. A task = 10–30 sequential model calls (think → call tool → observe → think …), each call's context = system prompt + growing history + tool outputs (history reaches 20–40K tokens by the end). Tools run in sandboxes (exec, web, code) with their own latency, 100 ms – several seconds. No human watches mid-task → nothing streams to a person, so token-level SLOs are irrelevant. The SLO is task p95 E2E < 60 s and tasks/hour throughput. Model: an ordinary 70B. Design the platform.

Run the loop. Before reading on: what binds — TTFT, TPOT, memory, throughput? None of them directly. The wall is a property of the composition of calls, and it forces a stateful design.

Stage 0 · Sequential fan-out and tail amplification (lesson 03)

The calls are sequential, not parallel — call k needs call k−1's output (the model must see the tool result before it reasons again). So task latency is a sum, not a max:

T_task = Σ (call_latency_k + tool_latency_k), k = 1 … n

A sum of n calls inherits the worst of them. Lesson 03's fan-out result applies directly: the chance the task hits at least one p99-slow call is 1 − 0.99ⁿ. At n = 20 that is 1 − 0.99²⁰ ≈ 18%. Nearly one task in five trips a p99 call — so your call-level p99 IS your task-level typical case. Designing to call-level median is self-deception; the task sees the tail every time.

Binding constraint, stage 0

Sequential fan-out + tail amplification. Because latencies add and the tail compounds, the only two levers on task latency are (a) cut the number of calls n and (b) tighten per-call p99 (not median). Everything else in this design serves one of those two. A median-fast / p99-slow call distribution is the failure mode — you must design and validate at call p99.

Stage 1 · Statefulness & KV residency (lessons 05, 13)

Now the redemption. Each call's context is the previous call's context plus the new tool output — i.e. call k's prompt is call k−1's prompt as a prefix. This is the exact prefix-caching premise from the code assistant (13) and SGLang (SGLang 04), but along the time axis of one task. With the history KV resident, call k re-prefills only the new tool-output + reasoning tokens (a few hundred), not the whole 30K history:

prefill_warm ≈ new_tokens / 11,300 tok/s ≈ 300/11,300 ≈ 27 ms vs. prefill_cold ≈ 30000/11,300 ≈ 2.7 s

That is a ~100× collapse in the prefill component of every call after the first. But it is conditional on two things, both from lesson 05:

Session-sticky / cache-aware routing. Every call of a task must land on the replica holding that task's KV. Round-robin would give a cold 30K prefill on most calls and resurrect the ~2.7 s penalty — multiplied by n. Route by task-id, not load. Link SGLang 05.
KV residency for the task's lifetime. A 30K-token history at 70B GQA-8 (~320 KB/token, lesson 02/13) is 30000 · 320 KB ≈ 9.6 GB of KV per active task — and it must stay pinned for the whole 30–60 s the task runs, not just one call. That number, not FLOPs, sets how many tasks fit a replica.

Binding constraint, stage 1

KV residency per long-lived task. At ~9.6 GB/task, a TP-4 replica with ~160 GB of KV room (capstone, lesson 12) holds only ~16 concurrent tasks before eviction would force ruinous re-prefills. Residency, not compute, caps task concurrency — and routing must be sticky to use it at all.

Stage 2 · Tool latency doesn't overlap — so multiplex across tasks (lesson 04)

While a sandboxed tool runs (100 ms – several s), generation for that task is blocked on a sequential dependency. If a replica served one task, the GPU would sit idle for that entire tool window — pure waste. The fix is the same continuous batching from lesson 04, but applied across tasks: the replica decodes other tasks while this one waits on its tool. A replica is a multiplexer over many tasks at different stages (some thinking, most waiting on tools).

This is why high task-concurrency per replica is good for utilization even though each task is mostly idle: with tool windows averaging, say, 2 s against ~1 s of active generation per call, a task spends ~⅔ of its life waiting — so you need ~3× the concurrent tasks just to keep the GPU busy. The 16-task residency cap from stage 1 and this multiplexing requirement are in direct tension: residency limits how many tasks you can keep warm, and you need many warm tasks to stay utilized. That tension is the heart of sizing this fleet.

Lever	Effect on task latency	Effect on utilization / cost
Prefix-cache the history (sticky routing)	~100× per-call prefill cut	frees compute, but pins 9.6 GB KV/task
Multiplex tasks across tool waits	none (per-task)	recovers the idle GPU during tool exec
Cut number of calls n	linear and tail reduction	fewer KV-pinned seconds per task

Stage 3 · Failure & statefulness is the hard part (lesson 11)

Stateless serving (cases 13–15) lets you treat replicas as cattle: drain, kill, reroute, autoscale freely. Statefulness breaks all of that.

Failover blast radius. If a sticky replica dies mid-task, every task pinned there loses its KV and its progress. Recovery means either re-prefilling the entire 30K history (~2.7 s of compute per task, times all pinned tasks) or restoring from a checkpoint of task state (the message/tool-call transcript) and re-prefilling from there.
Autoscaling can't drain instantly. You cannot scale a replica down while it holds live 30–60 s tasks; you must stop routing new tasks to it and wait for the in-flight ones to finish — a drain that takes as long as the longest task. Scale-down is slow by construction.
Checkpoint the transcript, not the KV. KV is huge (9.6 GB) and replica-local; the transcript (system prompt + messages + tool outputs) is small and portable. Persist the transcript per task so any replica can rehydrate by re-prefilling — trading compute for recoverability.

Binding constraint, stage 3

Statefulness turns a stateless-serving problem into a stateful one. Sticky routing, slow drains, and a failover blast radius proportional to tasks-per-replica are the real engineering cost of this platform — harder than any kernel. The design question stops being "how fast is a call?" and becomes "what happens to a 45-second task when its replica dies at second 40?"

Stage 4 · Levers to cut calls and tighten the tail (lessons 06, 13)

Stage 0 said the only levers are fewer calls and lower call p99. Concretely:

Speculative decoding for the short tool-call / JSON outputs that dominate agent turns. Agent calls are small-batch and latency-critical — exactly where spec decoding pays (lesson 06): the bandwidth-bound decode of a 30-token tool call gets a draft model's free ride.
Constrained / grammar decoding so every tool call is syntactically valid the first time. An invalid call is a wasted round-trip — it adds a call to n, the most expensive thing you can do. Fewer retries directly shrinks the fan-out.
Tiered models (like 13): a smaller/faster model for routine steps (formatting, simple tool dispatch), the 70B only for the hard reasoning steps. Most of n calls are routine; serving them on a 7B cuts both their latency and their KV footprint.

Interactive · task latency & the tail

Build a task out of calls and watch the tail. Turn prefix caching off and see re-prefill of the growing history wreck both latency and cost; raise the call count and watch 1 − 0.99ⁿ push the p95 estimate past 60 s even when the median looks fine.

Task latency & the tail

Per-call latency = generation + (prefix-cache off → re-prefill of the full growing history, ~30K/11.3k tok/s ≈ 2.7 s extra/call). Task median ≈ n·(call median + tool). p95 estimate adds the tail term (1−0.99ⁿ)·(call p99 − median) per lesson 03. Order-of-magnitude per the series' ±30% contract.

calls / task (n) 20 per-call median (ms) 800 per-call p99 (ms) 5000 avg tool latency (ms) 1500 prefix cache on

task median

–

task p95 (est.)

–

P(task hits a p99 call)

–

verdict vs 60s

–

What this case teaches

The wall moved up a level — to composition. A task isn't one call; it's a sequential sum, so it inherits per-call p99 via lesson 03's 1 − 0.99ⁿ. The only levers on task latency are fewer calls and a tighter call-level tail. Validate at call p99, not median.
Statefulness is the design. Prefix-caching the agent history is a ~100× win conditional on sticky routing + KV residency (~9.6 GB/task), which caps task concurrency and turns failover, drain, and autoscale into the hard problems (lessons 05, 11) — unlike the stateless cases 13–15.
Multiplex tasks across tool waits for utilization. Sequential tool latency idles a task's GPU; continuous batching across many tasks (lesson 04) recovers it — so high per-replica task concurrency is good even though each task mostly waits, in tension with the residency cap.
Contrast. Same ordinary 70B as the capstone; because the workload is a sequence of stateful calls with non-overlapping tool waits, the binding wall is end-to-end tail + statefulness — a scheduling/reliability problem, not the throughput or TTFT walls of every prior case.