Part VII - Reasoning, safety, and evaluation

Reasoning techniques - decomposition, search, verification

Up to now the agent loop has been about structure — routing, tools, memory, budgets. This lesson is about spending compute to think harder on a single hard decision. Reasoning techniques are the named ways an agent can trade more inference-time work for more accuracy: break a problem into steps, branch and backtrack, act-observe-act, debate, self-correct, and — the engineering crux — verify. The book's thesis is that performance no longer depends only on model size; it depends on how much "thinking time" you allocate and how you allocate it.

Book source

Chapter 17 — Reasoning Techniques (推理技术), plus Appendix F (探秘引擎: the reasoning-engine internals). PDF outline pages 186-198 and 287-298. Frameworks and examples named in the chapter: Chain-of-Thought, Tree-of-Thoughts, Self-Correction, ReAct, Program-Aided LMs (Google ADK BuiltInCodeExecutor), Chain/Graph of Debate, RLVR reasoning models, MASS, and the Inference Scaling Laws. Running code: the Google gemini-fullstack-langgraph-quickstart "DeepSearch" agent built on LangGraph.

The plan

Six moves. (1) State the one idea — reasoning is controlled extra computation, governed by an inference scaling law you can watch trade off. (2) Walk the ladder of techniques from cheapest to most expensive: CoT → Self-Consistency → ToT → ReAct → Debate, defining each at first use and saying when it pays. (3) Make the scaling law concrete with a best-of-k widget — see a small model with a verifier beat a bigger model run once. (4) The engineering rule the book hammers: a reasoning method without a verifier is just expensive guessing — so we build the verify step out with real numbers. (5) Externalize reasoning artifacts (assumptions, branches, checks, confidence), not raw chain text. (6) Thread it all through the running coding/research agent and hand off to guardrails.

Linear position

Prerequisite: Lesson 18 (Resource-aware optimization) — you must already think of compute, latency, and dollars as a budget, because every technique here spends that budget. Lesson 06 (Reflection) and lesson 08 (Planning) are the seeds: reflection is single-shot self-critique, planning turns a goal into a path. This lesson generalizes both into a menu of reasoning strategies with explicit cost.
New capability: Choosing and implementing a reasoning strategy — decomposition, search over thoughts, reason-act loops, multi-agent debate, self-correction — gated by a verifier and a thinking budget, with inspectable artifacts left behind in the trace.

1 · The one idea: reasoning is budgeted computation

A standard model call is a single forward pass: prompt in, answer out. For an arithmetic word problem or a multi-hop research question, that single pass often produces a confident wrong answer, because the model has committed to the first token of the answer before it has "worked anything out." Reasoning techniques are interventions that force the model to spend more inference-time computation — more tokens, more candidate paths, more tool round-trips — before committing. The book's framing: at inference time you can allocate variable "thinking time," and for hard problems that allocation buys accuracy, coherence, and robustness.

The governing principle is the inference scaling law (the book calls it 推理扩展定律). Unlike the training scaling law (bigger model, more data → better), the inference scaling law describes a runtime trade-off: by spending more compute when generating an answer — sampling multiple candidates and selecting among them, or searching deeper — even a small model can match or beat a larger model run once. The lever is not better hardware; it is a smarter inference algorithm (diverse sampling, self-consistency, verification). This is the economic heart of the chapter: it challenges "bigger is always better" and says allocate thinking deliberately.

Mental model

Think of a chess player. A weak player who is allowed to calculate three moves ahead and check each line can beat a stronger player forced to move instantly. The model is the player's raw intuition; the reasoning technique is the calculation time; the verifier is the "did this line actually work?" check. More thinking time only helps if you can tell which line was good — which is why verification, not just more tokens, is the real product.

2 · The ladder of techniques, cheapest to most expensive

The chapter is essentially a cost-ordered menu. Each rung spends more compute and unlocks a harder class of problem. Define each as you climb; the engineering job is to stop at the cheapest rung that solves your task.

Technique	What it does	Relative cost	Use when
Direct answer	Single forward pass, no scratchpad.	1×	Easy, low-risk, factual-lookup tasks.
Chain-of-Thought (CoT)	Emit intermediate reasoning steps before the answer; "think step by step." Turns one hard problem into a sequence of easy ones, and makes reasoning auditable.	~2-5× tokens	Arithmetic, multi-step logic, anything where the answer depends on a derivation.
Self-Consistency	Sample k independent CoT chains and take a majority vote on the final answer. Averages out one-off slips.	k× passes	The model is mostly right but occasionally derails; you have a discrete answer to vote on.
Tree-of-Thoughts (ToT)	Explore multiple reasoning branches as a tree, evaluate partial states, backtrack from dead ends, keep the best. Search, not a single line.	k×depth, with pruning	Planning, puzzles, strategy — where you must compare and abandon partial solutions.
ReAct (reason + act)	Interleave Thought → Action (tool call) → Observation → Thought… so the model grounds each step in real evidence and adapts to feedback.	per-step tool latency	Knowledge-intensive or environment-coupled tasks needing live data, calculation, or APIs.
Debate (CoD / GoD)	Multiple model instances propose, critique, and rebut, converging on a checked consensus. Chain-of-Debate is a roundtable; Graph-of-Debate is a non-linear support/rebut network.	N agents × rounds	High-value, bias-prone, or contested questions where one chain is too fragile.

Chain-of-Thought is the foundation — the book calls it the agent's "internal monologue." Its value is transparency as much as accuracy: a difficult problem becomes a series of simple, inspectable steps, which is exactly what makes agent actions auditable. The chapter's worked CoT example is an information-retrieval agent given a fixed five-step procedure (analyze the question → formulate retrieval queries → simulate retrieval and self-check → synthesize → review and refine) answering "what is the difference between classical and quantum computers, and name one application." The agent's visible thoughts walk through identifying key terms, drafting sub-queries, surfacing concepts (bits vs. qubits, superposition, entanglement), synthesizing, and a final review pass before emitting the answer.

Tree-of-Thoughts generalizes the single chain into a branching search: the model can fork into several candidate continuations, evaluate how promising each partial state is, backtrack out of dead ends, and select the strongest leaf. Self-correction is the inner loop that makes this work — the agent critiques its own draft against the original requirements, finds gaps, and rewrites. The chapter's self-correction example takes a flat social-media draft ("We have new products. They are eco and tech.") and runs a five-step critique that diagnoses weak verbs, missing eco-emphasis, and a soft call-to-action, then rewrites it into a punchy 148-character post. That is reflection (lesson 06) deployed as a reasoning primitive.

ReAct is the rung that fuses reasoning with the tool use you built in lesson 07. Instead of reasoning in a vacuum, the agent runs a Thought → Action → Observation loop: it thinks, picks an action from a fixed set (search, query a knowledge base, call an API, or "finish"), observes the result, and thinks again — repeating until it decides it has the answer. The book notes you can tune the thinking frequency: fact-checking inserts a thought before every step; navigation-style tasks can think less often. ReAct is strictly more robust than linear CoT for tasks that need repeated environment interaction, because every step is anchored to a real observation rather than to the model's prior.

Debate (Chain-of-Debate from Microsoft, and the richer Graph-of-Debate) is the multi-agent rung, building on lesson 09. Several models argue — propose, critique, rebut — and converge on an answer backed by agreement, search verification, or accepted fact. It trades a lot of compute for reduced bias and a transparent reasoning record, and is the bridge from single-agent reasoning to multi-agent collaboration.

3 · Watching the scaling law: best-of-k with a verifier

The inference scaling law is easy to state and easy to get wrong, so let's make it numeric. Suppose a base model answers a math task correctly with probability p = 0.55 on any single attempt. Run it once and you get 55% accuracy. Now sample k independent chains and select among them — but how you select changes everything:

Majority vote (self-consistency). You have no oracle, so you trust the most common answer. This helps only when correct answers cluster and wrong ones scatter.
Best-of-k with a verifier. You have a checker — unit tests, a math evaluator, a rubric model — that can recognize a correct answer when it sees one. Then you succeed if any of the k attempts is correct: accuracy = 1 - (1-p)^k.

Worked number. With p = 0.55 and a perfect verifier, best-of-4 gives 1 - 0.45^4 = 1 - 0.041 = 0.959 — 96% — for 4× the inference cost. A larger model that scores p = 0.80 in a single pass costs, say, 6× the small model per call. So which is cheaper for 95% accuracy? The small model at best-of-4 costs 4 × 1 = 4 units and hits 96%; the big model at one pass costs 6 units and hits only 80%. The small model with a verifier is both cheaper and more accurate. That is the inference scaling law in one comparison, and it is why "spend thinking time" beats "buy a bigger model" for verifiable tasks.

The catch the widget makes visible

Best-of-k only reaches that ceiling if the verifier is good. Drop verifier precision and you start selecting confidently-wrong answers, and the curve sags far below 1-(1-p)^k. Majority vote, with no verifier at all, plateaus much lower. The lesson: extra samples are cheap; the selector is the scarce resource.

Inference scaling — does more thinking time beat a bigger model?

A small model has single-pass accuracy p. Sample k reasoning chains and select among them. With a perfect verifier the small model reaches 1-(1-p)^k; an imperfect verifier (precision < 1) sometimes picks a wrong chain; majority vote has no verifier at all. The dashed line is a bigger model run once. Find the point where k× the small model overtakes the big one.

small-model p: 0.55 samples k: 4 verifier precision: 1.00 big-model accuracy: 0.80

Small × k

0.96

Big × 1

0.80

Winner

small

Compute (units)

4 vs 6

Show the core JS

// p: single-pass accuracy of the small model; k: samples; v: verifier precision
function smallAcc(p, k, v, mode){
  if (mode === 'verifier'){
    const anyCorrect = 1 - Math.pow(1 - p, k);   // at least one chain is right
    // verifier only keeps it if it correctly accepts the right chain;
    // with precision v it sometimes selects a wrong chain instead.
    return anyCorrect * v + (1 - anyCorrect) * 0; // imperfect v drags the ceiling down
  }
  // majority vote: no oracle. Each chain is an independent Bernoulli(p);
  // vote is correct iff > k/2 chains agree on the right answer (binomial tail).
  let acc = 0;
  for (let i = Math.floor(k/2) + 1; i <= k; i++) acc += binom(k, i) * p**i * (1-p)**(k-i);
  return acc;
}

4 · The engineering rule: no verifier, no reasoning

The single most important sentence in this chapter, restated for engineers: a reasoning method without a verifier is expensive guessing. ToT that explores ten branches but cannot tell a good branch from a bad one wastes ten times the compute to return the model's original guess. Debate without a way to score arguments converges on whichever agent is most fluent, not most correct. Self-consistency on a task with no canonical answer just amplifies the dominant bias. So the pattern shape is always the same four steps:

┌─────────────┐ ┌──────────────┐ ┌───────────────┐ ┌────────────────────┐ │ 1 CHOOSE │ │ 2 GENERATE │ │ 3 GROUND │ │ 4 SELECT / REVISE │ │ strategy by │──▶│ k hypotheses │──▶│ check each via│──▶│ verifier · test · │──▶ answer │ risk × cost │ │ or branches │ │ tools/evidence│ │ rubric · human │ + evidence └─────────────┘ └──────────────┘ └───────────────┘ └────────────────────┘ ▲ │ └──────────────── stop when budget hit or verifier passes ◀──┘

Where do verifiers come from? The book points at RLVR — Reinforcement Learning from Verifiable Rewards — as the reason modern "reasoning models" are good: they are trained on problems with checkable answers (math, code) so the model learns to generate long, self-correcting, backtracking reasoning traces, rewarded by an automatic checker rather than a human. The same insight applies at your layer: prefer tasks and decompositions where a step's correctness can be checked cheaply. Concretely, a verifier can be:

executable test

Code: run unit tests / a sandbox. Math: evaluate the expression. The gold standard — objective, cheap, the basis of Program-Aided LMs (Google ADK's BuiltInCodeExecutor), which offloads exact computation to deterministic code.

rubric / judge model

A second model scores candidates against explicit criteria. Cheaper than human, noisier than tests — its precision is the "verifier precision" slider above.

search / fact check

Ground a claim against retrieved evidence or multi-model consensus (the "verifiable" in Graph-of-Debate). Ties reasoning back to lesson 16's RAG.

human review

Most precise, most expensive. Reserve for high-risk, low-confidence selections — the escalation path from lesson 15.

5 · Externalize artifacts, not raw chain text

Appendix F asks six leading assistants (Gemini, ChatGPT, Grok, Kimi, Claude, DeepSeek) to describe their own reasoning, and they converge on a near-identical staged pipeline: parse the prompt → activate/retrieve relevant knowledge → choose a reasoning mode → simulate step-by-step thinking → generate → review and refine. Kimi's transcript is the cleanest illustration — asked whether 3⁴ or 4³ is larger, it shows stages 0-5: tokenize, classify the task as integer-power comparison, choose the cheapest safe strategy (direct computation, switching to a log comparison only if the exponents were large), retrieve 3³=27 and 4³=64, compute 3⁴=81, cross-check via 81 mod 5 = 1 vs 64 mod 5 = 4, and emit a confidence score of 0.99 with a noted edge case. Several models (Claude, DeepSeek) explicitly caveat that this is simulated reasoning — pattern prediction "walking along reasoning trails laid down in the training data," not human cognition.

The engineering takeaway is sharp: reasoning is an internal control resource, not a monologue to dump on the user verbatim. Raw chain text is unstructured, sometimes confabulated, and unsafe to expose. What you store and surface are artifacts — the structured by-products of reasoning that users and evaluators can actually inspect and score:

Decomposition — the subproblems the task was split into.
Assumptions — what the agent took as given (so a human can challenge them).
Search branches — which alternatives were tried and why each was abandoned.
Checks performed — the tests, queries, or rubric scores, and their results.
Confidence — a calibrated score (Kimi's 0.99) and the noted edge cases.

These artifacts are what lesson 21 (Evaluation) scores — the trajectory, not just the final answer — and what lesson 14's recovery loop replays. Storing them is the difference between an agent you can debug and one you can only restart.

6 · Running example: the coding/research agent thinks

Take our running agent two ways. As a debugger: a test is failing. The agent does not patch blindly. It forms two hypotheses (H1: an off-by-one in the loop bound; H2: a null returned by the upstream fetch), writes a targeted assertion for each, executes both — H2's assertion fails — patches the upstream guard, reruns the suite green, and stores {hypotheses, the two test results, the chosen fix, confidence} in the trace. The executable test is the verifier; the artifact is replayable. That is ReAct (think-act-observe) plus self-correction, and it stops the moment the suite passes.

As a researcher: the book's concrete template is Google's open-source DeepSearch agent (the gemini-fullstack-langgraph-quickstart repo), a LangGraph state machine whose nodes are exactly the reasoning loop — generate_query → web_research → reflection → finalize_answer — with a conditional edge after reflection that loops back to more searching if a knowledge gap remains, or proceeds to the answer if not:

from langgraph.graph import StateGraph, START, END

builder = StateGraph(OverallState, config_schema=Configuration)
builder.add_node("generate_query",  generate_query)
builder.add_node("web_research",    web_research)
builder.add_node("reflection",      reflection)       # <- the verify/self-correct step
builder.add_node("finalize_answer", finalize_answer)

builder.add_edge(START, "generate_query")
builder.add_conditional_edges("generate_query", continue_to_web_research, ["web_research"])
builder.add_edge("web_research", "reflection")
# reflection decides: knowledge gap remains -> search again, else -> finalize
builder.add_conditional_edges("reflection", evaluate_research, ["web_research", "finalize_answer"])
builder.add_edge("finalize_answer", END)

graph = builder.compile(name="pro-search-agent")

The reflection node is the verifier: it inspects what was found, names the remaining gap, and either spends more thinking budget (loop) or stops. The book also notes the advanced MASS framework (Multi-Agent System Search), which automates this design: optimize each agent's prompt in isolation, then search for the best interaction topology (it found that for the MBPP coding benchmark, "iterative self-correction combined with external verification" was optimal), then globally optimize the system prompts. MASS is the meta-lesson — the reasoning structure itself is a thing you can tune.

Failure modes

Length ≠ quality. Longer output mistaken for better reasoning; the model rambles to look thorough while the answer is unchanged.
Search without a selector. ToT or debate with no verifier — N× compute to return the original guess; the fluent branch wins, not the correct one.
Verifier rot. A weak judge model (precision < 1) confidently selects wrong candidates; the best-of-k ceiling silently collapses.
Hidden artifacts. Reasoning kept only as raw chain text — un-replayable, un-scorable, sometimes confabulated, unsafe to expose.
Budget blowout. No stop rule, so the reason-act loop or debate runs until the context window or the cost ceiling, not until the task is solved.
Reasoning on a non-verifiable task. Self-consistency on a subjective question just amplifies the dominant bias.

Implementation checklist

What is the cheapest rung (direct / CoT / self-consistency / ToT / ReAct / debate) that fits this task's difficulty and risk?
What is the verifier — test, rubric, search, or human — and how precise is it?
What is the thinking budget (max samples k, max tool steps, max debate rounds, cost cap)?
What is the explicit stop rule — verifier passes, budget hit, or confidence threshold?
Which artifacts (decomposition, assumptions, branches, checks, confidence) get stored for replay and eval?
What gets exposed to the user vs. kept internal?
How is reasoning quality measured downstream (lesson 21)?

Where this points next

We can now make an agent think harder on demand — and crucially, decide when not to. But more reasoning power is also more ways to go wrong: a ReAct loop that calls a destructive tool, a debate that talks itself into an unsafe conclusion, a self-correction step that "improves" an answer into a policy violation. Reasoning expands the action space; the next job is to bound it. Lesson 20 (Guardrails and safety) wraps the reasoning loop in layered constraints — input checks, tool permissions, output filters, review, and monitoring — so that thinking harder never means acting more dangerously. After that, lesson 21 (Evaluation and monitoring) scores the reasoning trajectory — using exactly the artifacts we agreed to externalize here.

Takeaway

Reasoning is controlled extra computation spent to improve a hard decision, governed by the inference scaling law: for verifiable tasks, a small model run k times with a good verifier beats a bigger model run once — more thinking, not more parameters. The techniques form a cost ladder — direct → CoT → self-consistency → ToT → ReAct → debate — and the engineering rule is non-negotiable: a reasoning method without a verifier is expensive guessing, so always close the loop with a test, rubric, search, or human, and stop when the verifier passes or the budget is spent. Reasoning is an internal control resource, not a monologue: store and expose structured artifacts (decomposition, assumptions, branches, checks, confidence), not raw chain text. The book's living example is Google's LangGraph DeepSearch agent, whose reflection node is the verify-and-decide step that loops or finalizes.

Interview prompts

What is the inference scaling law, and when does a small model beat a bigger one? (§1, §3 — for verifiable tasks, spending inference compute on k samples plus a verifier gives accuracy 1−(1−p)^k, which can exceed a bigger model's single-pass accuracy at lower total cost; the lever is the inference algorithm, not parameters.)
Order CoT, ToT, ReAct, self-consistency, and debate by cost and say when each pays. (§2 — CoT (cheap, multi-step logic) → self-consistency (k votes, fixes one-off slips) → ToT (search + backtrack, planning) → ReAct (tool-grounded, live data) → debate (N agents, high-value bias-prone questions); stop at the cheapest rung that solves the task.)
Why is ReAct more robust than plain Chain-of-Thought? (§2 — ReAct interleaves Thought→Action→Observation so each step is anchored to a real environment observation and the plan adapts to feedback, rather than reasoning entirely from the model's prior.)
"We added Tree-of-Thoughts and accuracy didn't move." Diagnose it. (§4 — almost certainly no usable verifier: branching explores alternatives but cannot tell a good branch from a bad one, so it returns the original guess at N× the cost; add an executable test, rubric, or search-based selector.)
How does verifier precision change the best-of-k benefit? (§3, §4 — a perfect verifier reaches 1−(1−p)^k; as precision drops it starts selecting confidently-wrong candidates and the ceiling collapses toward (or below) majority-vote; the selector is the scarce resource, not the samples.)
What reasoning should an agent expose vs. keep internal, and why? (§5 — expose structured artifacts (decomposition, assumptions, branches tried, checks and their results, calibrated confidence); keep raw chain text internal because it is unstructured, sometimes confabulated, unsafe, and the models themselves describe it as simulated pattern-prediction.)
What is RLVR and why does it matter for agent reasoning? (§4 — Reinforcement Learning from Verifiable Rewards trains reasoning models on checkable problems (math, code) so they learn long, self-correcting, backtracking traces rewarded by an automatic checker; it's why modern reasoning models are strong and why you should prefer cheaply-checkable decompositions in your own system.)