Part III - Action and grounding
Tool use - function calling as controlled action
Every pattern so far kept the agent inside language: it chained prompts, routed, ran branches in parallel, and critiqued its own drafts. But a model that only emits text cannot read today's stock price, run your test suite, or send an email. Tool use is the pattern that lets the loop leave language — it gives the model a controlled, audited channel for taking real action and bringing back facts it could never have known.
New capability: a validated action channel that turns a structured model decision into a call against a live external system, and turns the system's reply into an observation the loop can reason over.
run_anything. We close with the broader "tool calling" view (a tool can be an API, a database, even another agent) that lesson 08's planner and lesson 09's multi-agent system will lean on.1 · The wall: a model is sealed off from the world
An LLM is a powerful text generator, but it is, in the book's framing, fundamentally isolated from the outside world: its knowledge is static, frozen at training time, and it cannot perform an action or fetch a live fact. Ask it "what is London's weather right now?" and it can only guess from stale training data; ask it "what is the exact profit on 100 shares of AAPL?" and it will produce a confident, plausibly-formatted, and frequently wrong number, because arithmetic over fresh data is exactly what next-token prediction is bad at.
Tool use (implemented through the function-calling mechanism) is the standardized way over this wall. We describe a set of external functions to the model; the model, reading the user request and the available tool descriptions, decides whether a tool is needed and emits a structured request naming the tool and its arguments; an orchestration layer outside the model executes the real function and feeds the result back. The model never runs anything itself — it only ever produces text. The text just happens to be a machine-readable instruction that something else acts on.
get_stock_price with ticker="AAPL""), but a trusted operator outside the room is the one who actually pulls it and reads the result back through the phone. Function calling is the phone protocol; the operator is your orchestrator; the levers are your tools. Safety lives entirely in the operator, never in the analyst's good intentions.2 · The six-step round-trip
The book lays out tool use as a precise loop. Memorize the seam in the middle — it is where every safety and reliability decision lives.
- Tool definition. Describe each external function to the model: its purpose, name, argument types, and meaning. This is a prompt-time fact — the descriptions become part of the context the model reasons over.
- LLM decision. Given the request plus the tool catalog, the model judges whether one or more tools are needed.
- Call generation. If yes, the model emits structured output (typically JSON) naming the tool and extracting the arguments from the request.
- Tool execution. The framework / orchestration layer intercepts that structured output, identifies the requested tool, and actually runs the external function with the supplied arguments. This is outside the model.
- Observation / result. The output (or error) is returned to the agent.
- LLM processing (optional but usual). The model takes the observation as new context and produces the final reply, or decides to call another tool, reflect (lesson 06), or answer directly.
The dashed line between (3) and (4) is the entire lesson. Up to step 3, all you have is a string that claims to be a tool call — it has the trust level of any other model output, which is to say none. Steps 4 and 5 are ordinary software: parsing, validation, permission checks, execution, error handling. The discipline of tool use is refusing to let the model's intent become an action without that software in between.
3 · The tool contract — the real teaching object
The book's deepest point is that tool use is an interface design problem, not an API trick. The model only knows what the tool's description tells it, so the description is the API. A good tool contract pins down seven things:
| Field | What it answers | Why the model / orchestrator needs it |
|---|---|---|
| name | Which lever? | The token the model emits; must be unambiguous (get_stock_price, not tool3). |
| description | When to call it? | The model's only basis for the step-2 decision. "Use for questions like 'capital of France' or 'weather in London'" is the book's own style — examples beat adjectives. |
| arg schema | What goes in? | Types + meaning so the model can extract args from the request and the orchestrator can validate them. |
| side-effect class | Read-only? Reversible write? Irreversible? | Decides whether a dry-run, log, or human approval gate is required. |
| permissions | Allowed in this task / for this user? | Enforced by the orchestrator, outside the model — never by prompt text. |
| timeout | How long before we give up? | A hung tool must become a recoverable error (lesson 14), not a stuck agent. |
| error shape | What does failure look like? | Errors are observations too. The book's CrewAI tool raises ValueError when a ticker is missing — so the model can say "I couldn't find that price" instead of inventing one. |
That last row is worth dwelling on. In the book's CrewAI example, get_stock_price("AAPL") returns 178.15, but an unknown ticker raises an exception, and the task explicitly instructs the agent to say so plainly rather than guess. Designing the failure path is as much a part of the contract as the success path — an error that reaches the model as a clean, typed observation is recoverable; one that is swallowed or truncated becomes a silent wrong answer.
Frameworks, briefly. The chapter shows three. LangChain wraps a plain Python function with a @tool decorator (the docstring becomes the description), then create_tool_calling_agent + an AgentExecutor manage the round-trip — its search_information tool is a stubbed fact lookup. CrewAI attaches a @tool-decorated function to an Agent with a role/goal/backstory and runs it inside a Crew — the stock-price analyst above. Google ADK ships built-in tools you bind directly: google_search for web lookup, a BuiltInCodeExecutor that runs Python in a sandbox for exact arithmetic, and VSearchAgent for Vertex AI Search over a private datastore with source attribution. The abstractions differ; the six-step loop underneath is identical.
4 · The economics — observations and retries cost real money
Tool use is not free. Every observation you feed back in step 5 becomes input tokens on the next model call, and the whole conversation (system prompt + tool catalog + every prior call and observation) is re-sent on each turn because the model is stateless. This is where naive tool design quietly bankrupts a project, so let us do the arithmetic.
Worked example — the coding agent reads a file. Suppose the agent calls read_file on a 1,200-line source file. At roughly 10 tokens per line that observation is about 12{,}000 tokens. The agent then needs three more reasoning turns to finish the task. Because context is re-sent every turn, that one fat observation is paid for four times:
Now contrast a read_file_slice(path, start, end) tool that returns only the 40 relevant lines (~400 tokens):
At a representative input price of $3 per million tokens, the fat version costs 48{,}000 / 10^6 × \$3 ≈ \$0.14 for that single file read, the lean version ≈ \$0.005. Multiply by thousands of files across thousands of sessions and the difference is the company's cloud bill. This is why the book insists the observation be designed for the next decision, not for raw completeness.
Retries multiply everything. Give a flaky tool a retry budget of 3. If each attempt round-trips ~5,000 tokens of context, a tool that fails twice before succeeding spends 3 × 5{,}000 = 15{,}000 tokens and triples the latency. A retry budget is a real budget — cap it, and prefer a clean error observation (which the model can route around) over silent infinite retries. The widget below lets you feel both knobs at once.
5 · Running example — a typed, auditable tool set for the coding agent
Tie it together with the track's running coding/research assistant. The lazy design is one tool: run_anything(command), "protected" by a line in the prompt that says please be careful. That is the book's canonical anti-pattern — a broad capability gated only by text the model can ignore or be talked out of. The disciplined design replaces it with a small set of narrow, typed, side-effect-labelled tools:
tools = [
{"name": "list_files", "args": {"dir": "str"}, "side_effect": "read_only", "approval": False},
{"name": "read_file_slice", "args": {"path": "str", "start": "int", "end": "int"}, "side_effect": "read_only", "approval": False},
{"name": "apply_patch", "args": {"path": "str", "diff": "str"}, "side_effect": "reversible", "approval": False},
{"name": "run_read_only_tests", "args": {"cmd": "str", "timeout_s": "int=60"}, "side_effect": "read_only", "approval": False},
{"name": "request_approval", "args": {"action": "str", "reason": "str"}, "side_effect": "none", "approval": True},
]
# the seam: model intent -> validated action
request = parse_tool_call(model_output) # step 3 output is just a string
validate_schema(request, tools) # types, required args, ranges
check_permissions(request, task_state) # outside the model
if needs_approval(request): request_human_ok(...) # irreversible -> lesson 15
observation = execute_tool(request, timeout=...) # step 4, with timeout + error capture
task_state.observations.append(trim(observation)) # step 5, sized for the next decision
log.append({"call": request, "obs_hash": hash(observation)}) # replayable audit trail
The model can still act — edit code, run tests — but the action space is enumerable, every argument is validated, every side effect is labelled, anything irreversible routes through a human (lesson 15), and every call is logged with a hash of its observation so the whole episode is replayable. Notice read_file_slice rather than read_file: section 4's economics are baked straight into the tool surface.
Failure modes
- A broad
run_anything/execute_sqltool gated only by prompt text — the prompt is not a permission system. - Hallucinated observations: the model continues as if the tool ran. Force a real result into context before the next step.
- Tool results that hide errors or silently truncate — the model then reasons over a half-truth.
- No side-effect taxonomy, so an irreversible
deleteis treated like a harmlessread. - Fat observations re-sent every turn — cost and latency balloon (section 4).
- Vague descriptions, so the model calls the wrong tool or skips a needed one (the step-2 decision is only as good as the description).
Implementation checklist
- What side effects can the tool produce — read-only, reversible, irreversible?
- Who validates arguments and permissions, and where (it must be outside the model)?
- What does a clean, typed error look like, and does it reach the model as an observation?
- What timeout and retry budget bound the call?
- How large is the observation, and is it trimmed to what the next decision needs?
- Which calls require human approval before execution?
- Is every call logged with its arguments and observation for replay?
Where this points next
We now have a single validated action: the model decides, the orchestrator executes, the observation comes back. But real tasks need many such actions in the right order — read the failing test, find the bug, patch it, re-run the tests — and the model rarely knows that whole sequence up front. The broader "tool calling" view we ended on (a tool can be an API, a database, or even an instruction to another agent) is the hinge: once an agent can call tools, it can call plans. Lesson 08 (Planning — from goal to executable path) takes exactly this step, turning a goal into a multi-step path of tool calls when the path is not known in advance; lesson 09 then lets one of those "tools" be a whole specialist agent.
@tool + executor), CrewAI (tool-equipped agents), and Google ADK (built-in google_search, code executor, Vertex AI Search) all wrap the same loop.Interview prompts
- Walk through the six steps of a tool-use round-trip and point to where control leaves the model. (§2 — define, decide, emit (still inside the model), then the seam: execute and observe happen in the orchestrator, then the model processes the result. Everything after the emitted call is ordinary, untrusted-input software.)
- Why is a tool gated only by a sentence in the system prompt a security bug? (§5 — the prompt is model context, not an enforcement boundary; the model can be argued out of it. Permissions and validation must run in the orchestrator outside the model.)
- The model returns a final answer right after requesting a tool, without waiting. What went wrong? (§2 — hallucinated observation; the loop did not force a real result into context before the next step, so the model invented the tool output.)
- Your agent's token bill is 30× higher than expected on file-heavy tasks. First thing to check? (§4 — observation size. Fat observations are re-sent every turn and re-paid per retry; switch
read_filetoread_file_sliceand trim observations to what the next decision needs.) - Why does the book make an unknown stock ticker raise an error instead of returning null? (§3 — the error shape is part of the contract; a clean typed error becomes a recoverable observation ("I couldn't find that price") instead of the model inventing a number.)
- What's the difference between plain function calling and a Vertex AI Extension? (§3 — same pattern, different owner of execution: function calling hands the structured request back to your client to run; an Extension auto-executes it on the platform with built-in enterprise security.)
- How would you decide whether an agent action needs human approval? (§3, §5 — by its side-effect class; read-only runs freely, reversible writes are logged, irreversible actions route through
request_approvaland lesson 15's human-in-the-loop gate.)