all_lessons/agentic_systems/23 · exploration & discoverylesson 24 / 25

Part VIII - Autonomy and discovery

Exploration and discovery - finding unknown unknowns

The frontier pattern. Until now every loop we built answered a known question or optimized inside a known solution space. Discovery is the agent deciding, on its own, which question is worth asking next - proposing hypotheses, gathering evidence, and pushing into territory nobody specified in advance. It is also the most dangerous loop in the book, because "search freely" and "stay safe and on-budget" pull in opposite directions.

Book source
Chapter 21 - Exploration and Discovery (探索与发现); PDF outline pages 228-236. Running examples: Google Co-Scientist and Agent Laboratory.
Linear position
Prerequisite: Lesson 22 (Prioritization) - you can already score and queue candidate actions when many are useful at once. Discovery reuses that scorer, but now the candidates are hypotheses to test, not tasks to do.
New capability: a proactive loop that generates its own directions, spends a bounded budget probing them, records what it learned (including dead ends), and refines the search - so the agent can surface "unknown unknowns" instead of only executing what it was told.
The plan
Five moves. (1) Pin down what makes discovery a different pattern from retrieval, optimization, and planning - the thing it adds is choosing the next question. (2) Build the discovery loop as concrete state: a hypothesis ledger with novelty / value / feasibility / cost / risk scores, an evidence column, and a pruned list. (3) Confront the exploration-exploitation tradeoff head-on with a worked budget-allocation number and an interactive widget. (4) Walk the book's two systems - Google Co-Scientist (a six-role multi-agent "generate-debate-evolve" loop ranked by Elo) and Agent Laboratory (an automated research workflow with a three-reviewer gate). (5) Make the safety boundary load-bearing, not a footnote, then hand off to the capstone.

1 · What discovery adds that the earlier patterns do not

It is easy to confuse this pattern with three things we already built, so we draw the lines sharply. Retrieval (lesson 16) answers a question you already have: "what encryption algorithms are common?" It assumes the question is given. Optimization / prioritization (lesson 22) ranks a set of actions that already exist: it picks the best move inside a defined space. Planning (lesson 08) decomposes a known goal into known sub-steps. All three are reactive in the deep sense - the space of possibilities is handed to them.

Discovery is the pattern for when the solution space is not fully defined. The book's rule of thumb: reach for exploration and discovery when the task is open-ended, complex, or fast-changing, and the goal is to find the "unknown unknowns" rather than optimize a known process. The agent's distinctive act is generating the next question or hypothesis itself - and then deciding whether that question was worth the compute it cost. Static knowledge and pre-programmed solutions are, by assumption, insufficient here; the agent has to expand its own understanding.

RETRIEVAL question given ─▶ search ─▶ answer OPTIMIZATION action set given ─▶ rank ─▶ best known move PLANNING goal given ─▶ decompose ─▶ ordered steps ─────────────────────────────────────────────────────────────── DISCOVERY goal + frontier ─▶ GENERATE hypotheses ─▶ probe a few ▲ │ └── refine search ◀── record evidence the agent chooses the next QUESTION, not just the next answer

The book frames this as a paradigm shift "from passive reaction to active exploration," and lists where it pays off: scientific research automation (design and run experiments, propose new hypotheses, find new materials or drug candidates), game strategy (AlphaGo finding strategies or exploiting environment loopholes), market and trend discovery (scanning unstructured social / news / report data), security vulnerability discovery, creative generation, and personalized education. The common thread: the valuable result was not in the original problem statement.

2 · The discovery loop as concrete state

Skeleton pseudocode hides the one thing that makes this pattern reliable instead of a runaway browser: the ledger. Discovery is not "keep searching until something turns up." It is a bounded loop over an explicit table of hypotheses, each carrying scores and evidence, with a stop rule. Linearized teaching puts discovery after prioritization precisely because exploration is expensive and must be governed by the budget machinery from lesson 22.

01Generate candidate hypotheses or directions from the goal + background (literature, profiling data, prior traces). Aim for diversity, not just the obvious next step.
02Prioritize each by novelty, expected value, feasibility, cost, and risk - the same multi-criteria scorer from lesson 22, now scoring questions.
03Investigate the top-k within budget: gather evidence, run a cheap experiment or simulation, or query a tool.
04Record the result - including negative results - update beliefs, prune dead ends, and feed what changed back into the next round of generation.
hypotheses = generate_hypotheses(goal, background)        # diverse, not greedy
ledger     = []

for round in range(budget.max_rounds):
    ranked = prioritize(hypotheses,
                        criteria=["novelty", "value", "feasibility", "cost", "risk"])
    for h in ranked[:budget.experiments_per_round]:
        if not safety_check(h):        # domain constraints + risk review FIRST
            ledger.append(record(h, status="blocked"));  continue
        evidence = investigate(h)      # spend tokens / a tool call / a sim run
        ledger.append(record(h, evidence, status="tested"))
        if promising(evidence):
            hypotheses += expand(h)    # promising direction spawns children
        else:
            prune(h)                   # negative result is kept, not forgotten
    if budget.spent >= budget.cap or converged(ledger):
        break
return summarize(ledger)               # what was tried, what changed, what was pruned

The ledger is the agent's experiment log: what was tried, why it looked promising, what evidence moved the belief, and which paths were pruned. Without it the agent re-tests dead ends forever (a real failure mode below). With it, "negative result" becomes information that narrows the search - exactly the discipline a senior scientist applies, and, the book notes, exactly what Co-Scientist's reliance on open literature struggles to capture because negative results are rarely published.

3 · The real tension: exploration vs exploitation

Every discovery system fights one tradeoff (the book points straight at the Wikipedia "exploration-exploitation dilemma" in its references). Exploit means pour your remaining budget into the hypothesis that already looks best. Explore means spend some budget on long-shot, high-novelty directions that might be the breakthrough - or might be wasted tokens. Pure exploitation converges fast on a local optimum and never finds the unknown unknown; pure exploration burns the whole budget wandering and never deepens anything.

Treating novelty as if it were value is the classic beginner error: a hypothesis can be wildly original and also worthless. So the score that drives investigation is a blend, and we make it numeric.

score(h) = wv·value(h) + wn·novelty(h) + wf·feasibility(h) − wc·cost(h) − wr·risk(h)
Worked number - allocating a 60k-token discovery budget
A performance agent has profiling data and a hard budget of 60,000 tokens for one discovery round; each hypothesis probe (form it, fetch evidence, run one micro-benchmark) costs ≈ 5,000 tokens, so it can afford 12 probes. It generated four hypotheses:
Hypothesisvaluenoveltyfeasibilitycostscore
H1: add an LRU cache to the hot path0.80.10.90.20.62
H2: N+1 query in the ORM layer0.70.30.80.30.57
H3: lock contention under concurrency0.60.60.50.50.46
H4: rewrite the hot loop in a native ext.0.90.80.20.90.40
with weights wv=0.5, wn=0.3, wf=0.2, wc=0.4 (e.g. H1: 0.5·0.8 + 0.3·0.1 + 0.2·0.9 − 0.4·0.2 = 0.62). Pure exploitation spends all 12 probes deepening H1 and never learns that H3's novel lock-contention angle was the actual bottleneck. A budget split - say 70% exploit / 30% explore - sends ≈ 8 probes to the top scorer H1 and reserves ≈ 4 for the high-novelty tail (H3, H4). Cost ≈ 60k tokens either way; the split is what buys the chance of a non-obvious win. The right ratio is itself tuned from the experiment log: if exploration probes keep returning nothing, lower the explore share next round.

The widget below makes this allocation tangible. Slide the explore share and watch how the budget is divided between deepening the current best hypothesis and gambling on novel ones, and how the expected discovery (a stylized blend of finding the known win and the chance of a breakthrough) peaks somewhere in the middle - never at 0% and never at 100%.

Exploration vs exploitation - splitting a fixed discovery budget
A fixed budget of probes is split between exploit (deepen the current top hypothesis - a near-sure but capped payoff) and explore (probe novel long shots - mostly nothing, occasionally a breakthrough worth far more). Move the slider: the bars show how probes are allocated, and the dot shows expected total discovery. Pure exploit caps out early; pure explore wastes probes; the optimum sits in between.
Exploit probes
8
Explore probes
4
P(any breakthrough)
28%
Expected discovery
1.00
Show the core JS
// split the budget, then value each arm
const explore = Math.round(budget * x);          // x in [0,1]
const exploit = budget - explore;

// EXPLOIT: deepening the known-best hypothesis has diminishing returns,
// saturating toward a capped payoff (you can only refine H1 so far).
const KNOWN_CAP = 1.0;
const vExploit = KNOWN_CAP * (1 - Math.exp(-exploit / 4));

// EXPLORE: each novel probe is an independent shot at a breakthrough
// worth BIG; value = P(at least one hit) * BIG.
const BIG = 3.0;
const pAny = 1 - Math.pow(1 - p, explore);       // p = per-probe odds
const vExplore = pAny * BIG;

const expected = vExploit + vExplore;            // total expected discovery

4 · Google Co-Scientist - discovery as a multi-agent debate

The book's centerpiece is Google Co-Scientist (Google Research, built on the Gemini LLM): a system that helps human scientists generate hypotheses, refine proposals, and design experiments. Crucially, it is not one prompt - it is a multi-agent framework that simulates the collaborative, iterative nature of the scientific method, coordinated by a supervisor agent over an asynchronous task-execution framework so compute can scale elastically. The specialized roles map almost one-to-one onto the discovery loop in section 2:

AgentRole in the discovery loop
GenerationProposes initial hypotheses via literature exploration and simulated scientific debate (step 01).
ReflectionActs as peer review - judges each hypothesis for correctness, novelty, and quality.
RankingUses an Elo rating from simulated debate tournaments to compare and prioritize hypotheses (step 02).
EvolutionContinuously refines top-ranked hypotheses - simplifies, synthesizes views, explores unconventional reasoning.
ProximityBuilds a proximity graph, clustering similar ideas to help navigate the hypothesis space.
Meta-reviewSynthesizes all reviews and debates, finds common patterns, and feeds them back so the whole system improves.

The system runs a generate - debate - evolve loop that mirrors the scientific method: a human enters a research question, and the agents cycle through generating, evaluating (tournament-style ranking), and refining hypotheses. It uses test-time compute scaling - dynamically pouring more compute into iteratively improving outputs, which is the section-3 tradeoff turned into a knob.

Why this is credible, in numbers
On the GPQA benchmark the system's internal Elo score tracked answer accuracy closely, reaching 78.4% top-1 on the hard "diamond" set; across 200+ research goals, more test-time compute kept improving hypothesis quality. End-to-end, it proposed a novel drug candidate (KIRA6) for AML later confirmed in vitro, found a new epigenetic target for liver fibrosis validated in organoids, and independently re-derived an unpublished mechanism for why certain mobile genetic elements (cf-PICIs) are widespread - matching a finding another lab took a decade to confirm experimentally. The framing is augmentation, not automation: "scientist-in-the-loop," humans steering via natural language.

5 · Agent Laboratory - the discovery loop as a research pipeline

The chapter's worked code example is Agent Laboratory (an MIT-licensed open project by Samuel Schmidgall), an autonomous research workflow that, like Co-Scientist, aims to augment rather than replace human research. It runs four phases that are the discovery loop made concrete and durable:

01Literature review - LLM agents collect and analyze relevant work (e.g. from arXiv), building the knowledge base for everything downstream.
02Experimentation - design, data prep, execution, analysis; agents call tools like Python code generation/execution and Hugging Face models, and iterate on real results.
03Report writing - generate a structured report, integrating results with the literature and using LaTeX for figures and typesetting.
04Knowledge sharing - publish to AgentRxiv, a decentralized repository where autonomous research agents store, retrieve, and build on each other's results.

Two design choices are worth lifting into your own discovery agents. First, the roles mirror a human research team - a ProfessorAgent sets the agenda and assigns tasks, a PostdocAgent carries out the research, plus ML-engineer and software-engineer agents for data prep - which keeps a strategic owner above the explorers. Second, and most importantly, the evaluation gate: a ReviewersAgent runs a three-reviewer mechanism, three independent critics with different prompted perspectives (one focused on whether the work yields research insight, one on field-level impact, one on genuine novelty), each returning a structured JSON score on axes like Originality, Quality, Significance, Soundness, and a hard Accept/Reject. This is the lesson-06 reflection pattern, hardened into a discovery quality gate so that "novel" still has to pass "sound."

# Agent Laboratory's reviewer gate (paraphrased): three perspectives, structured scores
reviewers = [
    "strict but fair - does the experiment yield real research insight?",
    "critical but fair - does this matter for the field's impact?",
    "open-minded but rigorous - is there a genuinely new idea here?",
]
scores = [get_score(plan, report, reviewer=r) for r in reviewers]
# each get_score returns JSON: Originality/Quality/Clarity/Significance (1-4),
# Soundness, Overall (1-10), Confidence (1-5), Decision in {Accept, Reject}
accept = aggregate(scores).decision == "Accept"   # novelty alone does not pass

6 · Safety is part of the pattern, not an add-on

Open-ended autonomy is exactly the capability that can wander into sensitive or risky space - dangerous chemistry, security exploits, harmful content - because, by definition, the agent is going where no one specified. The book makes safety review a first-class step, not a wrapper. Co-Scientist runs every research goal and generated hypothesis through a safety check; its preliminary evaluation against 1,200 adversarial goals showed it could reliably refuse dangerous inputs, and it was rolled out cautiously via a Trusted Tester Program. In the loop of section 2 this is why safety_check(h) runs before any token is spent investigating, and why high-impact claims route to human review (lesson 15) with provenance attached. Discovery without domain constraints, provenance, and review is not bold - it is unaccountable.

Running example - the performance discovery agent

Thread it together with our coding/research assistant. A performance agent is asked to "make the checkout service faster" - open-ended, no specified fix. It does not start rewriting code. It: (1) generates bottleneck hypotheses from profiling data and traces (the four H's in section 3); (2) prioritizes them by value/novelty/feasibility/cost and reserves 30% of its 60k-token budget for the novel ones; (3) investigates the top probes - each a single small change behind a micro-benchmark, never a blind rewrite; (4) records results in the ledger, keeping that "LRU cache gave +2% only" negative result so it is never re-tried, and on finding lock contention (the novel H3) expands that direction. A safety/quality gate confirms the proposed change passes tests before it is surfaced, and a human approves anything touching production config.

Checkpoint exercise

Try it
Take an open-ended goal of your own ("reduce our cloud bill", "find an untapped market segment"). Write three hypotheses. For each, give a value, novelty, and feasibility score in [0,1], compute the blended score from section 3, and state the single piece of evidence that would reject it cheaply. Then decide your explore/exploit split and justify it in one sentence.

Failure modes

  • No stop rule. Open-ended browsing with no budget cap or convergence test - the loop runs until the credits run out.
  • Novelty mistaken for value. The most original hypothesis gets investigated even though no evaluation says it is useful or sound.
  • No ledger. Rejected hypotheses are not recorded, so the agent re-tests the same dead ends every round.
  • Pure exploitation. All budget poured into the obvious top hypothesis; the unknown unknown is never reached.
  • Safety as an afterthought. Risk review runs after investigation, or not at all, so the agent has already probed sensitive space before anyone checked.
  • No provenance. A high-impact claim is surfaced with no evidence trail, so a human cannot verify or a hallucination slips through.

Implementation checklist

  • What is the exploration goal, and what counts as "discovered"?
  • How are diverse hypotheses generated (not just the greedy next step)?
  • How are novelty and value scored, and with what weights?
  • What is the token / time / money budget, and the explore-exploit split?
  • Is there a ledger recording tried, promising, evidence, and pruned?
  • Are negative results kept and fed back into generation?
  • Does safety_check run before investigation, with human review for high-impact claims?
  • Is there an evaluation gate (e.g. multi-reviewer) so novelty must pass soundness?

Where this points next

Discovery is the last of the book's 21 patterns, and it is fitting that it is the one that most needs all the others: it leans on planning to structure the search, tool use and RAG to gather evidence, multi-agent collaboration for the generate/critique/rank roles, reflection for the reviewer gate, memory for the ledger, prioritization for the budget, and guardrails plus human-in-the-loop for safety. That is the whole point of the final lesson. Lesson 24, the capstone, composes a single coding-and-research agent in which these patterns stop being separate chapters and become one architecture - act, remember, recover, evaluate, prioritize, discover - and maps which framework (LangChain/LangGraph, Google ADK, CrewAI) you would reach for to build each piece.

Takeaway
Exploration and discovery is the pattern for open-ended, ill-defined spaces, where the agent's distinctive act is choosing the next question, not just answering a given one. Build it as a bounded loop over an explicit hypothesis ledger: generate diverse candidates, score them by novelty and value (never novelty alone), investigate the top few within a budget, and record every result including the negatives so dead ends are not re-walked. The governing tension is exploration vs exploitation - the optimal budget split lives between the extremes, never at either end. The book's exemplars - Google Co-Scientist's six-role generate-debate-evolve loop ranked by Elo (78.4% on GPQA-diamond, real drug/target findings) and Agent Laboratory's lit-review-to-AgentRxiv pipeline with a three-reviewer gate - show the same shape at scale, and both insist on safety review and scientist-in-the-loop, because the freedom to discover is also the freedom to wander somewhere it should not.

Interview prompts