Part I - Foundations
Prompt and context contracts - the interface to the model
Lesson 01 drew a hard line between the model (a stateless function from text to text) and the control system around it. This lesson builds the only wire that crosses that line: the prompt going in and the structured decision coming out. We turn that wire from prose into an engineered contract — typed inputs, a typed output schema, a validation step, and an explicit context budget — because everything downstream (chaining, routing, planning, tools) is just code reading the model's answer, and code cannot read a paragraph.
New capability: A reusable contract for every model call — named input slots, a typed output schema, a validation gate, and a sized context budget — so the next lessons can compose model calls into pipelines without each one re-inventing how to talk to the model.
1 · A prompt is a function signature, not a wish
In lesson 01 we established that the model is a stateless function: tokens in, tokens out, no memory between calls. The instinctive way to use such a function is to write it a paragraph — "You're a support assistant, look at this conversation and figure out what to do." That works in a chat window because a human reads the reply and supplies all the judgment. It fails the instant a program has to read the reply, because the program needs to branch on the answer, and you cannot reliably if on a paragraph.
So the first reframe of this lesson: stop thinking of a prompt as a request and start thinking of it as a function signature. A function signature names its inputs, names its return type, and tells the caller what happens on bad input. A production prompt must do the same three things. The book (Appendix A) opens with this exact spirit — prompt engineering is "structuring requests, providing relevant context, specifying output format" — and it lists the plain principles that make the signature legible to the model:
- Clarity and specificity. The model is a pattern matcher; ambiguity gets resolved by guessing. Name the task, the output format, and the constraints explicitly.
- Conciseness with action verbs. The book gives a literal verb list — Classify, Extract, Summarize, Rank, Return, Translate, Rewrite — because "Summarize the following text" activates a sharper region of the model's training than "Think about summarizing this."
- Positive instructions over constraints. Tell the model what to do, not a wall of "don't." Constraints are for safety and format; an instruction made only of prohibitions makes the model optimize for evasion instead of the goal.
- Iterate. Prompting is empirical: draft, test, inspect failures, refine, and log every attempt. The book's running coffee-machine example refines "write a product description" → "highlight speed and easy cleaning" → "write a description for 'SpeedClean Coffee Pro', emphasize 2-minute brewing and self-cleaning, for busy professionals." Each revision narrows the signature.
These are table stakes. The engineering — the part that turns a chat trick into an agent component — is everything that makes the signature machine-enforceable, which is the rest of this lesson.
2 · The input has two halves: durable policy vs. dynamic context
The most common beginner bug is one giant prompt that mixes everything: the persona, the rules, three examples, the conversation history, the current database row, and the last tool's raw output, all concatenated into a single string rebuilt from scratch on every turn. It is brittle (one edit risks all of it), expensive (you resend the unchanging parts every call), and the model loses the thread. The fix is to split the input by lifetime.
The book draws this line as prompt engineering vs. context engineering. The system prompt is the durable layer: it sets the model's role, rules, tone, and safety stance, and it is the same on every call ("You are a technical writer; tone must be formal and precise"). Context engineering is the dynamic layer: it assembles, per call, the background the model needs right now — conversation history, retrieved documents (RAG), tool outputs, and implicit data like user identity or environment state. The book's thesis is striking and worth internalizing: output quality depends more on the richness of the assembled context than on the model architecture. A frontier model with a thin context view underperforms a weaker model given the right operational picture.
Two payoffs fall out of the split. First, the durable half is identical across calls, so providers can cache it and you stop paying full price to resend your 1,200-token system prompt on every turn (we cost this in §5). Second — the book's warning, restated — a bigger context window is not memory. Pasting the entire transcript into the dynamic half because "the window is huge now" is not context engineering; it is hoarding. Real memory is scoped storage with explicit read/write policies, which is its own pattern (lesson 10, Memory management). For now, treat the dynamic half as something you deliberately assemble and size, not a dumping ground.
--- markers — so the model can tell the instruction from the article from the tool output. A retrieved document that bleeds into your instructions is a classic prompt-injection surface (revisited under Guardrails, lesson 20): untrusted evidence must be visibly fenced off from trusted policy.
3 · The output must be a schema — parse, don't validate
This is the load-bearing idea of the lesson. The book states it directly: in an agent, structured output is what lets the model's answer become the next component's input. Without it, the model's "decision" is prose that some regex or your own re-prompting has to interpret — and that interpretation layer is where agents rot.
Concretely: do not ask the model to "explain what it wants to do." Make it return a typed object whose every field the downstream code already knows how to consume. For our running coding/research assistant, the very first decision each turn is a route, and its contract looks like this:
route_schema = {
"route": "answer_directly | retrieve | call_tool | ask_human | stop",
"tool": "name of the tool to call, required iff route == call_tool",
"args": "object matching the chosen tool's parameter schema",
"confidence": "number in [0, 1]",
"reason": "short, evidence-grounded justification",
"missing_info":"list of facts still needed before acting",
"risk": "low | medium | high"
}
# The model does NOT execute anything. It emits this object;
# the control loop (lesson 01) reads it and acts.
# Invalid JSON is not a route. It is a validation failure.
parsed = parse_json(model_output) # may raise
decision = validate(parsed, route_schema) # may raise
The book's recommended mechanism is Pydantic: define the schema as a typed Python model, then call Model.model_validate_json(llm_output) to parse and validate in one step. A field like confidence: float = Field(..., ge=0, le=1) or email: EmailStr is checked at the boundary; malformed output raises ValidationError instead of flowing onward as a plausible-looking lie.
from pydantic import BaseModel, Field, ValidationError
from typing import Literal, Optional
class Route(BaseModel):
route: Literal["answer_directly","retrieve","call_tool","ask_human","stop"]
tool: Optional[str] = Field(None, description="required iff route == call_tool")
confidence: float = Field(..., ge=0.0, le=1.0)
reason: str
risk: Literal["low","medium","high"] = "low"
try:
decision = Route.model_validate_json(model_output) # parse + validate, one step
except ValidationError as e:
decision = handle_invalid(e) # retry with the error fed back, or escalate
The book names the principle: parse, don't validate at the system boundary. You don't accept a loose dict and sprinkle if-checks downstream; you convert the model's text into a known-good typed object once, at the edge, and everything inside the boundary can trust it. Three concrete wins:
- Hallucination is constrained. An
enum/Literalroute can only ever be one of five values; the model cannot invent a sixth path your loop doesn't handle. - Failure is a caught exception, not a wrong action. Malformed output is rejected at the gate, so it never triggers a tool with garbage arguments.
- Interoperability. A validated
Routeobject passes safely to functions, APIs, and the next pipeline stage with its types intact.
4 · Prompting techniques are fields in the contract, not the architecture
Appendix A is a catalog of techniques, and beginners tend to treat them as a bag of incantations. The framing this track insists on: each technique is a decision about what goes in a slot of the contract, not a replacement for having a contract.
The crucial point: CoT and the schema are not in tension. A common failure is to "add CoT" by letting the model ramble, then trying to parse a decision out of the ramble. Instead, give the reasoning its own field. The model writes its thinking into a reasoning string and its decision into the typed fields; the parser ignores the prose and reads the structure:
class ToolDecision(BaseModel):
reasoning: str # CoT lives here, after the analysis
route: Literal["call_tool","answer_directly","ask_human"]
tool: Optional[str] = None
confidence: float = Field(..., ge=0, le=1)
# The model "thinks step by step" in `reasoning`, then commits to typed fields.
# You keep the reasoning for debugging; control flow reads only the typed part.
So ReAct, few-shot, CoT, RAG, persona — every one of them resolves to "which slot, with what content, returning what schema." The architecture is the contract; the techniques are how you fill it.
5 · The context budget — putting numbers on "just add more context"
"Give the model more context" sounds free. It is not: every token in the prompt costs money and latency, and the dynamic half grows every turn. The book treats context as a managed resource (the explicit subject of lesson 18); here we just learn to size the contract so it cannot quietly blow up. A worked example for our coding/research assistant on a long session:
The naive transcript approach. Suppose instead of a 600-token summary you paste the raw transcript, and it grows ~700 tokens per turn. By turn 20 the dynamic state alone is ~14,000 tokens. Add the durable 2,000 and the 3,000 of evidence and you are sending ~19,000 input tokens per call. At an illustrative $3 / 1M input tokens that is ~$0.057 per call, and you re-pay almost all of it every turn because nothing is cached and the transcript keeps growing — call it ~$1.10 across the 20-turn session, dominated by re-sent history the model barely uses.
The engineered approach. Hold a compressed 600-token state summary instead of the transcript, and cache the 2,000 durable tokens (policy + exemplars). Per-call input is ~5,200 tokens, of which only ~3,200 are charged at full rate after the cache discount. That is roughly $0.011 per call versus $0.057 — about 5× cheaper, plus lower latency because the prompt is a quarter the size and the model isn't hunting for the needle in 14k tokens of chat. The contract's "context budget" is exactly this discipline: decide the token ceiling for each slot, summarize state instead of accumulating it, cache the durable half, and reject calls that would exceed the budget rather than letting them silently degrade.
Running example — from paragraph to contract
Put it together on the support/coding assistant. The user types: "my order hasn't shipped." The wrong design lets the model reply with a paragraph — "I think I may need to look up your order in the database, could you confirm..." — which the control loop then has to interpret. The contracted design forces the model to emit a validated object the loop can act on directly:
# Either a concrete, executable action…
{ "route": "call_tool", "tool": "lookup_order",
"args": { "order_id": "A-2291" }, "confidence": 0.91,
"reason": "user gave order id A-2291 earlier in session", "risk": "low" }
# …or an explicit request for the missing input, never a guess.
{ "route": "ask_human", "tool": null, "args": {},
"confidence": 0.4, "reason": "no order id in context",
"missing_info": ["order_id"], "risk": "low" }
Notice the contract makes uncertainty first-class: a low confidence plus a populated missing_info is a legal, well-typed answer that routes to ask_human — the model is allowed to say "I don't have enough to act," and the loop has a defined place to send that. This is the refusal mode the contract rule demands, and it is the seam into human-in-the-loop (lesson 15).
Checkpoint exercise
Literal field and one numeric range. Then answer the four contract questions explicitly: What fields enter? What schema must leave? What happens on invalid output? Which context is durable vs. dynamic? If you cannot answer all four, the prompt is not yet a component.Failure modes
- The mega-prompt. One string mixing policy, examples, history, and raw tool output — brittle, costly, and the model loses the thread.
- Prose-driven control flow. Code branching on free-form text ("if reply contains 'search'…"); detonates on the first negation.
- Window-as-memory. Pasting the full transcript because the context window is large — cost and latency balloon while relevance drops.
- Unfenced evidence. Retrieved/untrusted text concatenated next to instructions with no delimiter — an injection surface.
- No refusal mode. A schema with no legal way to say "not enough info," so the model fills the gap with a confident guess.
- Unbudgeted context. Slots with no token ceiling, so the dynamic half silently grows until quality and economics both crater.
Implementation checklist
- What named fields enter the prompt, and which are durable vs. dynamic?
- What typed schema must leave it? (Use
Literal/enums + numeric ranges.) - What happens on invalid output — retry with the error fed back, or escalate?
- Is the durable half cached so you don't re-pay for it every call?
- Is evidence fenced with delimiters and tagged with its source?
- Is there a legal "I'm not sure / missing X" path (confidence + missing_info)?
- What is the token budget per slot, and what is the evaluator that grades the output?
Where this points next
We now have a single, reliable model call: a contract with named inputs, a durable/dynamic split, a typed schema gated by a parser, and a sized budget. But one well-formed call rarely solves a real task — "analyze this report, extract the data points, then draft an email" overloads a single call and the model drops steps. The natural next move is to chain contracts: make the typed output of one call the typed input of the next, so a hard task becomes a sequence of focused, individually-validated steps. That dependency — output-as-input, with structured handoffs between stages — is exactly what makes the chain reliable, and it is the subject of lesson 03. The book's market-research pipeline (summarize → identify trends as JSON → draft email) is the canonical first chain, and it only works because each link's output is the contract we just built.
Interview prompts
- Why can't production agent code consume free-form model output? (§3 — code must branch on the answer, and you cannot reliably
ifon prose; an enum/Literal schema gated by a parser turns the answer into a known-good typed object, and malformed output into a caught error rather than a wrong action.) - What is the difference between prompt engineering and context engineering? (§2 — prompt engineering is the durable system layer: role, rules, format; context engineering is the per-call dynamic assembly of history, retrieved docs, tool outputs, and implicit data. The book's claim: output quality depends more on context richness than model architecture.)
- Why is "the context window is huge, so just paste everything" wrong? (§2, §5 — a window is not memory; raw transcripts grow ~linearly per turn, ballooning cost and latency (~$0.057 vs ~$0.011 per call in the worked example) while burying the relevant signal. Use a compressed state summary and cache the durable half.)
- What does "parse, don't validate" mean at the model boundary? (§3 — convert text to a known-good typed object once at the edge, e.g.
Model.model_validate_json, instead of accepting a loose dict and scattering ad-hoc checks downstream; inside the boundary everything can trust the type.) - How do you add chain-of-thought without breaking structured output? (§4 — give the reasoning its own
reasoningfield and the decision its own typed fields; the model thinks in prose, the parser reads only the structure. CoT and schema are complementary, not in tension.) - How should an agent's contract represent uncertainty? (§3, running example — as first-class typed fields: a
confidencein [0,1] plus amissing_infolist and anask_humanroute, so "I don't have enough to act" is a legal, validated answer with a defined destination — the refusal mode the contract rule requires.) - You're seeing intermittent crashes when a tool fires with garbage arguments. Where's the bug? (§3 — a model call that updates state or fires a tool is reaching code without passing a schema gate; add parse+validate before the action so invalid args raise at the boundary and route to retry/escalate.)