Part VI - Knowledge, communication, and optimization
Resource-aware optimization - cost, latency, budgets
Up to now the agent has cared only about whether it reaches the goal. This lesson adds the second axis every production system is graded on: at what cost. Compute, wall-clock time, money, context window, tool quotas, and human attention are all finite. Resource-aware optimization is the policy layer that spends each of them deliberately — choosing the cheapest path that is still good enough, and degrading gracefully instead of failing when a budget runs dry.
New capability: a budget-aware policy that, for each step, picks the model, tool, context size, retry depth, and fallback path that meets the quality bar at the lowest resource spend — and that can shed quality on purpose when constraints tighten.
1 · Why "always use the best model" is a bug, not a safety margin
The book opens with a distinction worth internalizing. Ordinary planning (lesson 08) cares about the sequence of actions — which step comes next. Resource-aware optimization cares about how each action is executed under a budget: more accurate but expensive model, or faster and cheaper one? Spend extra compute for a finer answer, or return a coarser answer now? The agent is no longer just choosing what to do; it is choosing how much to spend doing it.
The book's anchor example is an agent serving a financial analyst over a large dataset. If the analyst wants a quick preliminary read, the agent summarizes the key trends with a fast, cheap model. If the analyst needs a high-precision forecast to back a major investment decision — and has the budget and time — the agent allocates more resources and reaches for the slower, stronger model. Same agent, same goal shape, radically different resource policy depending on the stakes and constraints of the moment.
The mental model: think of the agent as a contractor with a fixed budget for a job. A good contractor does not hire the master craftsman to hang a single picture frame, and does not hand a load-bearing wall to the cheapest day laborer. The skill is matching the worker to the task. An agent that routes every request to the strongest model is the contractor who hires the master for everything — it will produce fine work and then go bankrupt. "Use the best model everywhere" feels safe but is the most common and most expensive anti-pattern in the field.
2 · The routing agent — the core mechanism, with the arithmetic
The book's standardized solution is a routing agent: a lightweight front component that first classifies the incoming request's complexity, then dispatches it to the most appropriate and economical downstream model or tool. Simple questions go to a fast, cheap model; complex reasoning goes to a strong, expensive one. The book's concrete pairing is Google's Gemini 2.5 Flash (the economical model) versus Gemini 2.5 Pro (the high-capability model), wired together with the Google Agent Development Kit (ADK), whose multi-agent orchestration supports exactly this kind of dynamic, LLM-driven routing.
The book sketches two routers. The trivial one uses a cheap heuristic — query length in words:
# Google ADK style router (conceptual). The book's simplest signal: word count.
class QueryRouterAgent(BaseAgent):
async def _run_async_impl(self, ctx):
query = ctx.current_message.text
if len(query.split()) < 20: # short → assume simple
return await gemini_flash_agent.run_async(ctx.current_message)
else: # long → assume complex
return await gemini_pro_agent.run_async(ctx.current_message)
The book is honest that word count is a placeholder. A stronger router uses an LLM or a fine-tuned classifier to judge complexity directly — sending fact-recall queries to Flash and deep-analysis queries to Pro — and notes that you can prompt-engineer or fine-tune the router itself on a dataset of (question → best model) pairs to raise its accuracy. The router is a model whose only job is to allocate the other models well.
Worked numbers: what the route actually saves
Routing is only worth its complexity if the arithmetic is large, so let us do it. Take an illustrative price gap of strong model at $10 per million input tokens / $30 per million output versus cheap model at $0.50 / $1.50 — a 20× gap, typical of a flagship-vs-mini pairing. Suppose a daily workload of 100,000 requests, each averaging 800 input + 400 output tokens.
cost per cheap request = 800·($0.50/1e6) + 400·($1.50/1e6) = $0.00040 + $0.00060 = $0.0010
Everything on the strong model: 100,000 × $0.0200 = $2,000/day (~$730k/year).
Now route. Suppose classification finds 70% of requests are "simple" (safe on the cheap model) and 30% need the strong one. Add the router's own classification call — say it runs on the cheap model at the same $0.0010 each, so 100,000 × $0.0010 = $100/day of routing overhead.
= $100 + $70 + $600 = $770/day
That is a 61% reduction ($2,000 → $770), and the router overhead of $100 is dwarfed by the $1,230 it saves. The lesson the numbers teach: the router pays for itself the moment a meaningful fraction of traffic is genuinely simple, and the savings scale with both the price gap and the simple-traffic share. The widget below lets you turn those two dials and watch the daily bill and the quality risk move together.
3 · Fallback chains and the critique agent — reliability and quality control
Routing for cost is only half the chapter. The book pairs it with two mechanisms that keep the optimized system trustworthy.
Fallback chains: surviving the real world
The book's recurring reliability strategy is the fallback mechanism: when the preferred model is unavailable — overloaded, rate-limited, or its output is filtered — the system automatically switches to a default or more economical model so the service continues instead of failing outright. This is graceful degradation at the model layer.
The chapter's concrete vehicle is OpenRouter, a unified gateway over hundreds of models that offers two routing styles:
- Automatic selection (
"model": "openrouter/auto"): the gateway picks the best available model for the request content, and — crucially — returns metadata naming the model that actually served it. - Ordered fallback (
"models": ["anthropic/claude-3.5-sonnet", "gryphe/mythomax-l2-13b"]): try the first; on failure (outage, rate limit, content filter) fall to the next, until one succeeds or the list is exhausted. The final cost and model id are those of whichever model actually completed the request.
# Ordered fallback as a budget-and-reliability policy.
PREFERRED = ["anthropic/claude-3.5-sonnet", "openai/gpt-4o-mini", "local/llama-3-8b"]
for model in PREFERRED:
try:
resp = call(model, prompt, timeout=8)
state.trace.record(requested=PREFERRED[0], realized=model, cost=resp.cost)
return resp
except (RateLimited, ServiceDown, Filtered):
continue # degrade to the next, cheaper/available option
raise AllModelsExhausted() # only now do we truly fail
The critique agent: catching silent quality loss
The danger of an aggressive router is invisible: it does not crash, it just returns slightly worse answers and a smaller bill. The book's answer is a dedicated critique agent (批判智能体) — a component whose system prompt makes it the quality-assurance reviewer of the answering agent's output. The book gives it a concrete job: scrutinize research results for factual correctness, completeness, and bias; flag missing data and inconsistent reasoning; and feed that judgment back to improve the system.
The critique agent does not manage the budget directly. It improves resource allocation indirectly: if it keeps finding that Flash answers are inadequate for a class of question, that signal is used to retune the router — or even to fine-tune / RL the routing policy. Conversely, if it finds Pro was overkill for questions Flash would have nailed, the router learned to over-spend. In the book's words, the critique agent catches the two routing errors that matter: simple questions sent to the expensive model (waste) and complex questions sent to the cheap model (quality debt).
The book's three-way OpenAI router, end to end
The chapter's fullest worked code is a question-answering system that classifies every prompt into one of three buckets before choosing a model — the cleanest illustration of routing as a complexity-to-resource map:
| Class | Meaning | Model chosen | Extra resource |
|---|---|---|---|
simple | Direct factual answer, no reasoning or fresh data | gpt-4o-mini (cheapest) | none |
reasoning | Logic, math, or multi-step thinking | o4-mini (reasoning model) | none |
internet_search | Needs current / post-training info | gpt-4o (strong) | Google Custom Search call, results injected as context |
# The book's pattern: classify first (cheap), then bind model + tools to the class.
def handle_prompt(prompt):
cls = classify_prompt(prompt)["classification"] # one cheap call returns JSON
search = google_search(prompt) if cls == "internet_search" else None
answer, model = generate_response(prompt, cls, search)
return {"classification": cls, "response": answer, "model": model} # model recorded
Note three things the book builds in: the classifier returns strict JSON (so routing is parseable, not vibes); a tool — Google search — is attached only to the class that needs it (paying for a search API only when freshness is required); and the realized model is returned in the result for attribution.
4 · Beyond model switching — the full optimization taxonomy
The chapter's real teaching payload is that dynamic model switching is just the first item in a spectrum. A mature resource manager is a policy over many levers. The book enumerates them; here they are with the running coding/research agent in mind:
Worked example: context pruning with a token budget
Context is a priced, finite resource, so it gets its own budget. Suppose the coding agent's window is 128k tokens and its retrieved working set has grown to 90k tokens of file contents and tool logs, while a single Pro call at $10/M input would cost 90,000 × $10/1e6 = $0.90 just to read the context — on every turn. Prune: summarize the 60k tokens of old tool logs down to a 4k-token running summary, keep the 30k tokens of source files verbatim because they are the evidence. New context = 34k tokens → $0.34/turn, a 62% cut. The discipline is in what you keep: pruning that discards a citation or a safety check is not optimization, it is corner-cutting (see the failure modes below).
5 · Make every decision auditable — or it is just quality debt
The chapter is emphatic on one governance point, and it is the rule that separates real optimization from cost-cutting theater. A strong implementation records every resource decision in the trace: why this model, why this context size, why this tool, why this fallback fired. And evaluation (lesson 21) must compare routes, not only final answers.
Running example — the research / coding agent, made resource-aware
Threading the whole chapter through one agent: a research assistant uses a small model to classify each source (relevant? primary? paywalled?), a strong model only to synthesize conflicting evidence, and deterministic, code-based citation checks before emitting the final answer (free compared to an LLM verifier). It carries a per-task budget in state — say $0.50 and 12 seconds. When the budget tightens mid-task it degrades on purpose: switch synthesis from Pro to Flash, prune old tool logs to a summary, and label the result "preliminary — confidence reduced under budget" rather than overrunning or failing. A critique agent samples outputs and, when it finds Flash syntheses are too thin for multi-source conflicts, feeds that back to make the router send conflict-heavy tasks to Pro. Every hop records its model, tokens, and cost, so the eval harness can answer the only question that matters: did the cheaper route stay correct?
Failure modes
- Strong-model-everywhere: routing is hard, so the team defaults to the flagship for all traffic and burns the budget (our worked example: $2,000/day vs $770).
- Over-aggressive routing: the threshold sends complex work to the cheap model; answers degrade silently because nothing crashes. (The critique agent exists for this.)
- Context pruned past the evidence: summarization cuts tokens until the citation or constraint the answer depends on is gone.
- Untracked fallback: a fallback fires for an hour; dashboards still show the preferred model; cost and quality attribution are both wrong.
- Cost optimized without a quality gate: savings ship with no per-route regression test, so quality debt accrues invisibly.
Implementation checklist
- What budgets exist (tokens, dollars, latency, retries, human escalations), and are they in task state?
- Which steps can use a cheaper model or tool, and which truly need the strong one?
- What caches / cheaper tools exist, and does the router prefer them when sufficient?
- Is there a fallback chain, and does the trace record the realized model, not the requested one?
- How is cost attributed per request and per route?
- Do evals compare quality by route, not just final answers?
- What is the graceful-degradation path when each budget is exhausted?
Where this points next
We have learned to spend less by routing easy work to cheap models and hard work to strong ones. But notice the asymmetry: the "complex" route still treats the strong model as a black box and just lets it think once. The next lesson, 19 · Reasoning techniques — decomposition, search, verification, is the other side of this coin. Where resource-aware optimization decides how little compute to spend, reasoning is about spending extra compute deliberately — decomposing a hard problem, searching over candidate solutions, and verifying before committing — precisely on the high-stakes "complex" path this lesson's router just identified. Routing tells you when a problem is worth more compute; reasoning tells you how to spend it well.
Interview prompts
- Why is "always use the strongest model" an anti-pattern rather than a safe default? (§1, §2 — it is pure waste on the large fraction of simple traffic; routing easy work to a cheap model cut our worked workload from $2,000 to $770/day, a 61% saving, because the router overhead is dwarfed by the gap it captures.)
- How does a routing agent work, and what signal does it route on? (§2 — it classifies request complexity first, then dispatches: cheap/fast model for simple, strong/expensive for complex. The signal can be a heuristic like query length, an LLM classifier, or a fine-tuned router trained on (question → best model) pairs.)
- What is a fallback chain and why must you record which model actually answered? (§3 — an ordered list of models tried until one succeeds, surviving rate limits/outages/filtering. OpenRouter returns the realized model id; without it you cannot attribute cost, reproduce results, or eval by route — the dashboard lies.)
- What problem does the critique agent solve that the router cannot? (§3 — silent quality loss. The router does not crash on a bad route; the critique agent reviews answers for correctness/completeness/bias and feeds that back to retune routing, catching both waste and quality debt.)
- Beyond model switching, name three more resource levers and one risk of context pruning. (§4 — adaptive tool selection, context pruning/summarization, proactive workload prediction, cost-sensitive exploration, energy-aware deployment, distributed compute, learned allocation, graceful degradation. Risk: pruning past the evidence — dropping a citation or safety check the answer depends on.)
- How do you tell an optimization apart from quality debt? (§5 — record every resource decision (explainable, measurable, reversible) and run evals that compare quality per route. A cheaper route that drops a citation, verification, or safety check is debt, not savings; without a per-route quality gate you cannot distinguish them.)
- Your agent has a $0.50 / 12s budget and is about to overrun. What does graceful degradation look like? (§4, running example — downshift the model (Pro→Flash), prune old logs to a summary while keeping evidence, and return a coarser answer labeled with reduced confidence — never a hard crash or silent overrun.)