Part VIII - Autonomy and discovery
Exploration and discovery - finding unknown unknowns
The frontier pattern. Until now every loop we built answered a known question or optimized inside a known solution space. Discovery is the agent deciding, on its own, which question is worth asking next - proposing hypotheses, gathering evidence, and pushing into territory nobody specified in advance. It is also the most dangerous loop in the book, because "search freely" and "stay safe and on-budget" pull in opposite directions.
New capability: a proactive loop that generates its own directions, spends a bounded budget probing them, records what it learned (including dead ends), and refines the search - so the agent can surface "unknown unknowns" instead of only executing what it was told.
1 · What discovery adds that the earlier patterns do not
It is easy to confuse this pattern with three things we already built, so we draw the lines sharply. Retrieval (lesson 16) answers a question you already have: "what encryption algorithms are common?" It assumes the question is given. Optimization / prioritization (lesson 22) ranks a set of actions that already exist: it picks the best move inside a defined space. Planning (lesson 08) decomposes a known goal into known sub-steps. All three are reactive in the deep sense - the space of possibilities is handed to them.
Discovery is the pattern for when the solution space is not fully defined. The book's rule of thumb: reach for exploration and discovery when the task is open-ended, complex, or fast-changing, and the goal is to find the "unknown unknowns" rather than optimize a known process. The agent's distinctive act is generating the next question or hypothesis itself - and then deciding whether that question was worth the compute it cost. Static knowledge and pre-programmed solutions are, by assumption, insufficient here; the agent has to expand its own understanding.
The book frames this as a paradigm shift "from passive reaction to active exploration," and lists where it pays off: scientific research automation (design and run experiments, propose new hypotheses, find new materials or drug candidates), game strategy (AlphaGo finding strategies or exploiting environment loopholes), market and trend discovery (scanning unstructured social / news / report data), security vulnerability discovery, creative generation, and personalized education. The common thread: the valuable result was not in the original problem statement.
2 · The discovery loop as concrete state
Skeleton pseudocode hides the one thing that makes this pattern reliable instead of a runaway browser: the ledger. Discovery is not "keep searching until something turns up." It is a bounded loop over an explicit table of hypotheses, each carrying scores and evidence, with a stop rule. Linearized teaching puts discovery after prioritization precisely because exploration is expensive and must be governed by the budget machinery from lesson 22.
hypotheses = generate_hypotheses(goal, background) # diverse, not greedy
ledger = []
for round in range(budget.max_rounds):
ranked = prioritize(hypotheses,
criteria=["novelty", "value", "feasibility", "cost", "risk"])
for h in ranked[:budget.experiments_per_round]:
if not safety_check(h): # domain constraints + risk review FIRST
ledger.append(record(h, status="blocked")); continue
evidence = investigate(h) # spend tokens / a tool call / a sim run
ledger.append(record(h, evidence, status="tested"))
if promising(evidence):
hypotheses += expand(h) # promising direction spawns children
else:
prune(h) # negative result is kept, not forgotten
if budget.spent >= budget.cap or converged(ledger):
break
return summarize(ledger) # what was tried, what changed, what was pruned
The ledger is the agent's experiment log: what was tried, why it looked promising, what evidence moved the belief, and which paths were pruned. Without it the agent re-tests dead ends forever (a real failure mode below). With it, "negative result" becomes information that narrows the search - exactly the discipline a senior scientist applies, and, the book notes, exactly what Co-Scientist's reliance on open literature struggles to capture because negative results are rarely published.
3 · The real tension: exploration vs exploitation
Every discovery system fights one tradeoff (the book points straight at the Wikipedia "exploration-exploitation dilemma" in its references). Exploit means pour your remaining budget into the hypothesis that already looks best. Explore means spend some budget on long-shot, high-novelty directions that might be the breakthrough - or might be wasted tokens. Pure exploitation converges fast on a local optimum and never finds the unknown unknown; pure exploration burns the whole budget wandering and never deepens anything.
Treating novelty as if it were value is the classic beginner error: a hypothesis can be wildly original and also worthless. So the score that drives investigation is a blend, and we make it numeric.
| Hypothesis | value | novelty | feasibility | cost | score |
|---|---|---|---|---|---|
| H1: add an LRU cache to the hot path | 0.8 | 0.1 | 0.9 | 0.2 | 0.62 |
| H2: N+1 query in the ORM layer | 0.7 | 0.3 | 0.8 | 0.3 | 0.57 |
| H3: lock contention under concurrency | 0.6 | 0.6 | 0.5 | 0.5 | 0.46 |
| H4: rewrite the hot loop in a native ext. | 0.9 | 0.8 | 0.2 | 0.9 | 0.40 |
The widget below makes this allocation tangible. Slide the explore share and watch how the budget is divided between deepening the current best hypothesis and gambling on novel ones, and how the expected discovery (a stylized blend of finding the known win and the chance of a breakthrough) peaks somewhere in the middle - never at 0% and never at 100%.
4 · Google Co-Scientist - discovery as a multi-agent debate
The book's centerpiece is Google Co-Scientist (Google Research, built on the Gemini LLM): a system that helps human scientists generate hypotheses, refine proposals, and design experiments. Crucially, it is not one prompt - it is a multi-agent framework that simulates the collaborative, iterative nature of the scientific method, coordinated by a supervisor agent over an asynchronous task-execution framework so compute can scale elastically. The specialized roles map almost one-to-one onto the discovery loop in section 2:
| Agent | Role in the discovery loop |
|---|---|
| Generation | Proposes initial hypotheses via literature exploration and simulated scientific debate (step 01). |
| Reflection | Acts as peer review - judges each hypothesis for correctness, novelty, and quality. |
| Ranking | Uses an Elo rating from simulated debate tournaments to compare and prioritize hypotheses (step 02). |
| Evolution | Continuously refines top-ranked hypotheses - simplifies, synthesizes views, explores unconventional reasoning. |
| Proximity | Builds a proximity graph, clustering similar ideas to help navigate the hypothesis space. |
| Meta-review | Synthesizes all reviews and debates, finds common patterns, and feeds them back so the whole system improves. |
The system runs a generate - debate - evolve loop that mirrors the scientific method: a human enters a research question, and the agents cycle through generating, evaluating (tournament-style ranking), and refining hypotheses. It uses test-time compute scaling - dynamically pouring more compute into iteratively improving outputs, which is the section-3 tradeoff turned into a knob.
5 · Agent Laboratory - the discovery loop as a research pipeline
The chapter's worked code example is Agent Laboratory (an MIT-licensed open project by Samuel Schmidgall), an autonomous research workflow that, like Co-Scientist, aims to augment rather than replace human research. It runs four phases that are the discovery loop made concrete and durable:
Two design choices are worth lifting into your own discovery agents. First, the roles mirror a human research team - a ProfessorAgent sets the agenda and assigns tasks, a PostdocAgent carries out the research, plus ML-engineer and software-engineer agents for data prep - which keeps a strategic owner above the explorers. Second, and most importantly, the evaluation gate: a ReviewersAgent runs a three-reviewer mechanism, three independent critics with different prompted perspectives (one focused on whether the work yields research insight, one on field-level impact, one on genuine novelty), each returning a structured JSON score on axes like Originality, Quality, Significance, Soundness, and a hard Accept/Reject. This is the lesson-06 reflection pattern, hardened into a discovery quality gate so that "novel" still has to pass "sound."
# Agent Laboratory's reviewer gate (paraphrased): three perspectives, structured scores
reviewers = [
"strict but fair - does the experiment yield real research insight?",
"critical but fair - does this matter for the field's impact?",
"open-minded but rigorous - is there a genuinely new idea here?",
]
scores = [get_score(plan, report, reviewer=r) for r in reviewers]
# each get_score returns JSON: Originality/Quality/Clarity/Significance (1-4),
# Soundness, Overall (1-10), Confidence (1-5), Decision in {Accept, Reject}
accept = aggregate(scores).decision == "Accept" # novelty alone does not pass
6 · Safety is part of the pattern, not an add-on
Open-ended autonomy is exactly the capability that can wander into sensitive or risky space - dangerous chemistry, security exploits, harmful content - because, by definition, the agent is going where no one specified. The book makes safety review a first-class step, not a wrapper. Co-Scientist runs every research goal and generated hypothesis through a safety check; its preliminary evaluation against 1,200 adversarial goals showed it could reliably refuse dangerous inputs, and it was rolled out cautiously via a Trusted Tester Program. In the loop of section 2 this is why safety_check(h) runs before any token is spent investigating, and why high-impact claims route to human review (lesson 15) with provenance attached. Discovery without domain constraints, provenance, and review is not bold - it is unaccountable.
Running example - the performance discovery agent
Thread it together with our coding/research assistant. A performance agent is asked to "make the checkout service faster" - open-ended, no specified fix. It does not start rewriting code. It: (1) generates bottleneck hypotheses from profiling data and traces (the four H's in section 3); (2) prioritizes them by value/novelty/feasibility/cost and reserves 30% of its 60k-token budget for the novel ones; (3) investigates the top probes - each a single small change behind a micro-benchmark, never a blind rewrite; (4) records results in the ledger, keeping that "LRU cache gave +2% only" negative result so it is never re-tried, and on finding lock contention (the novel H3) expands that direction. A safety/quality gate confirms the proposed change passes tests before it is surfaced, and a human approves anything touching production config.
Checkpoint exercise
Failure modes
- No stop rule. Open-ended browsing with no budget cap or convergence test - the loop runs until the credits run out.
- Novelty mistaken for value. The most original hypothesis gets investigated even though no evaluation says it is useful or sound.
- No ledger. Rejected hypotheses are not recorded, so the agent re-tests the same dead ends every round.
- Pure exploitation. All budget poured into the obvious top hypothesis; the unknown unknown is never reached.
- Safety as an afterthought. Risk review runs after investigation, or not at all, so the agent has already probed sensitive space before anyone checked.
- No provenance. A high-impact claim is surfaced with no evidence trail, so a human cannot verify or a hallucination slips through.
Implementation checklist
- What is the exploration goal, and what counts as "discovered"?
- How are diverse hypotheses generated (not just the greedy next step)?
- How are novelty and value scored, and with what weights?
- What is the token / time / money budget, and the explore-exploit split?
- Is there a ledger recording tried, promising, evidence, and pruned?
- Are negative results kept and fed back into generation?
- Does
safety_checkrun before investigation, with human review for high-impact claims? - Is there an evaluation gate (e.g. multi-reviewer) so novelty must pass soundness?
Where this points next
Discovery is the last of the book's 21 patterns, and it is fitting that it is the one that most needs all the others: it leans on planning to structure the search, tool use and RAG to gather evidence, multi-agent collaboration for the generate/critique/rank roles, reflection for the reviewer gate, memory for the ledger, prioritization for the budget, and guardrails plus human-in-the-loop for safety. That is the whole point of the final lesson. Lesson 24, the capstone, composes a single coding-and-research agent in which these patterns stop being separate chapters and become one architecture - act, remember, recover, evaluate, prioritize, discover - and maps which framework (LangChain/LangGraph, Google ADK, CrewAI) you would reach for to build each piece.
Interview prompts
- How is discovery different from retrieval or optimization? (§1 - retrieval answers a given question and optimization ranks a given action set; discovery operates where the solution space is undefined and the agent generates the next question/hypothesis itself to find "unknown unknowns.")
- What state turns a discovery agent from a runaway browser into a reliable loop? (§2 - an explicit hypothesis ledger with novelty/value/feasibility/cost/risk scores, an evidence column, kept negative results, and a budget-driven stop rule.)
- Why not just always investigate the highest-scoring hypothesis? (§3 - that is pure exploitation; it converges on a local optimum and never finds the high-novelty breakthrough. Reserve part of the budget for exploration; the optimum split is interior, not 0% or 100%.)
- Why is treating novelty as value a bug? (§3 - a hypothesis can be original and worthless; the driving score blends value, novelty, feasibility, cost, and risk, and an evaluation gate must confirm soundness before it counts.)
- Describe Google Co-Scientist's architecture. (§4 - a Gemini-based multi-agent system with Generation, Reflection, Ranking-by-Elo, Evolution, Proximity, and Meta-review agents under a supervisor, running a generate-debate-evolve loop with test-time compute scaling.)
- How does Agent Laboratory keep autonomous research honest? (§5 - four phases lit-review/experiment/report/AgentRxiv, a human-team role hierarchy with a Professor owner, and a three-reviewer gate returning structured JSON scores plus a hard Accept/Reject so novelty must also be sound.)
- Where does safety sit in a discovery loop, and why there? (§6 - the safety check runs before any investigation spend, and high-impact claims route to human review with provenance, because open-ended autonomy can wander into sensitive space by definition; Co-Scientist screened 1,200 adversarial goals.)
- You have a 60k-token budget and 5k/probe. How do you allocate it? (§3 - ~12 probes; split e.g. 70/30 exploit/explore so ~8 deepen the top hypothesis and ~4 gamble on novel ones, then tune the ratio from the ledger if exploration keeps returning nothing.)