Part VIII - Autonomy and discovery
Prioritization - choosing the next best action
An agent that can plan, act, recover, and evaluate now faces a harder question than can I do this? — it faces which of the many things I could do should I do next? Prioritization is the policy that keeps an autonomous loop pointed at the most valuable action when the to-do list is longer than the budget. This is the pattern that separates a goal-seeking agent from a script that runs its steps in the order they were written.
New capability: A next-action policy — a scoring function over candidate actions, plus a reprioritization rule that re-runs the scorer whenever an observation changes the state.
1 · Why a loop needs a priority policy
Up to this lesson, the agent's control flow has been mostly determined. Prompt chaining (lesson 03) runs steps in a fixed line. Routing (lesson 04) picks one branch from a known set. Planning (lesson 08) produces an ordered list and the loop walks it. In all three, the order of work is decided before the agent starts acting, and the agent's job is to execute that order.
That breaks the moment two things are true at once, which is the normal case for a real autonomous agent: there are more useful actions available than the budget allows, and the world changes while the agent works. A coding agent mid-task could read three more files, run the test suite, ask the user a clarifying question, refactor a helper, or open a related issue — all defensible, none free. A plan written ten steps ago does not know that the last test just failed, or that the token budget is two-thirds gone, or that a dependency is now blocking everything downstream. The book's framing: without an explicit next-action decision process, an agent becomes inefficient, stalls, or fails to reach its key goals.
Prioritization is the answer. Instead of trusting a frozen plan, the agent keeps the plan as a candidate set and, at each step, asks: of everything I could do right now, which action has the best expected value per unit of cost, given the current state? The mental model is a hospital triage nurse. Patients (candidate actions) arrive faster than doctors (budget) can see them. The nurse does not treat in arrival order; she scores each by severity and time-sensitivity, and — critically — re-triages when a waiting patient deteriorates. The plan is not a queue you read top to bottom; it is a queue you re-sort after every new observation.
2 · The six criteria, as one scoring function
The chapter defines prioritization as evaluating each candidate task against a set of criteria, then applying scheduling/selection logic to pick the next action, and finally allowing dynamic adjustment as the environment changes. It lists six criteria. We make them concrete by mapping each to a number the agent can actually compute:
The book says scoring can range from simple rules (keyword → P0/P1/P2, as in its code example) to a complex scoring system to full LLM reasoning ("rank these five actions and explain"). The middle ground — an explicit weighted score — is the one worth being able to write on a whiteboard, because it is auditable and cheap. A clean form for the coding/research agent:
where each term is normalized to roughly [0,1] and the weights encode policy. Value folds importance and urgency (how much this advances the goal, scaled up if time-critical). Unlocks is the dependency term — how many blocked actions this clears. infoGain rewards uncertainty reduction (the cheap probe that tells you which of two paths is real). Risk and cost are subtracted: a destructive or expensive action must clear a higher bar. The selection logic is then just argmax over allowed actions — "allowed" because the guardrails from lesson 20 can veto an action regardless of score (you never run a high-value action that violates a permission).
3 · Worked example — five candidates, one winner, then a re-rank
Our running agent is fixing a failing bug report in a Python repo. Its monitor (lesson 13) says the goal is "make the failing test pass without regressing others," its budget (lesson 18) is 60k tokens with ~22k already spent. It proposes five candidate next actions. We score each term on [0,1] and use weights wval=3, wdep=2, winfo=2, wrisk=2, wcost=1.
| Candidate action | value | unlocks | infoGain | risk | cost | score |
|---|---|---|---|---|---|---|
| A · Run the one failing test in isolation | 0.4 | 0.3 | 0.9 | 0.0 | 0.1 | 3.5 |
| B · Read 6 more files for context | 0.3 | 0.2 | 0.4 | 0.0 | 0.6 | 1.5 |
| C · Refactor the helper module | 0.5 | 0.1 | 0.1 | 0.7 | 0.5 | -0.1 |
| D · Ask the user which behavior is intended | 0.6 | 0.4 | 0.8 | 0.1 | 0.2 | 3.8 |
| E · Open a tracking issue for a side bug | 0.2 | 0.0 | 0.1 | 0.0 | 0.2 | 0.6 |
Take action A as the arithmetic: 3(0.4) + 2(0.3) + 2(0.9) − 2(0.0) − 1(0.1) = 1.2 + 0.6 + 1.8 − 0 − 0.1 = 3.5. The top two are D (3.8) and A (3.5) — both cheap, both high-information, both unblock the rest. The refactor C scores negative because its risk term (0.7) dominates: it could regress passing tests, which directly contradicts the goal. Reading six more files (B) feels productive but is the classic trap — high cost, low information.
So the agent picks D — a single clarifying question — because the expected information gain about which behavior is correct is worth more than blindly running code against an assumption. Suppose the user replies, and now the intended behavior is unambiguous. That observation changes the state, so the agent reprioritizes:
D's infoGain term collapses to near zero once the question is answered, so it falls out of contention, and A — run the now-meaningful test — rises to the top. Nobody re-edited a plan. The same scoring function, re-run against the new state, produced a new order. That is the whole pattern: state change → re-score → new next action.
4 · Interactive — move the weights, watch the queue re-sort
The five candidates above are loaded below with their raw term values. Drag the policy weights and the bars re-rank live; the leader is highlighted. Then flip the "user clarified intent" toggle to apply the observation from §3 — watch D's information gain drop to zero and the leader change. This is the scorer the agent runs internally; the only thing that ships to production is which bar is longest.
5 · Three levels: goal, plan-step, action
The book is explicit that prioritization is not one decision but happens at three altitudes, and a serious agent runs it at all three:
The analogy the book reaches for is a human team manager who ranks tasks by weighing the team's input. The manager does not re-pick the company's quarterly goal (L1) every morning, but does re-pick what each engineer touches next (L3) as standups surface new blockers. Mixing the levels is a common design error: re-litigating the strategic goal on every tool return is paralysis; never revisiting it means an agent that grinds on a now-pointless objective.
6 · The book's LangChain project-manager agent
The chapter's running code is a concrete instance of L2/L3 prioritization. A LangChain create_react_agent is given a small toolset over an in-memory task store — create_new_task, assign_priority_to_task, assign_task_to_worker, list_all_tasks — and a system prompt that encodes the priority policy in natural language: when a request says "ASAP", "urgent", or "critical", map it to P0; if priority is unstated, default to P1; if no worker is named, default to Worker A. Each Task is a Pydantic model with an optional priority field constrained to P0 / P1 / P2, and the tools validate their arguments through Pydantic schemas.
In the book's simulation, scenario 1 ("need the new login system ASAP, assign to Worker B") is parsed by the agent into: create the task, recognize "ASAP" → assign P0, assign Worker B, then list tasks. Scenario 2 ("review the marketing site content") has no urgency words, so the agent applies the default → P1, Worker A. The point the book draws out: the agent interpreted an ambiguous request, chose its tools, and ordered its actions itself. That keyword-to-priority mapping is the simplest end of the "simple rule → scoring system → LLM reasoning" spectrum from §2 — and it is exactly the policy our weighted scorer generalizes when "urgent" is no longer a single keyword but a real cost/benefit tradeoff.
7 · Failure modes and the checklist
Failure modes
- Recency bias. Doing the most recently suggested action instead of the highest-scoring one — the agent's own last thought hijacks the queue. (The fix: always re-score the full candidate set, not just react to the latest token.)
- Skipping cheap probes. Ignoring a low-cost, high-infoGain action (run one test, ask one question) and instead doing expensive work on an unverified assumption — action B/C in §3.
- No budget-aware priority. Cost weight set to zero, so the agent explores forever and exhausts its tokens before finishing — the failure lesson 18 exists to prevent.
- Reprioritization thrash. Two actions near-tied; re-scoring every tick flips between them and nothing completes. Needs a hysteresis margin or commit-to-action rule.
- Stale priorities. The opposite — never re-scoring after a failure, so the agent keeps executing a plan the world already invalidated.
- Scoring a vetoed action. Picking a high-value action that guardrails forbid; "allowed" must gate the argmax, not be checked after.
Implementation checklist
- What is the explicit candidate set right now? (Keep it enumerable and inspectable.)
- What are the scoring terms, and are they normalized to a common scale?
- Which action unlocks the most blocked work (dependency term)?
- Which is the cheapest action that most reduces uncertainty?
- Is cost/budget actually weighted, with a stop when the budget is near-spent?
- What observations trigger a re-score — and is the trigger "new evidence," not "time"?
- Is there a margin / commit rule to prevent thrash on near-ties?
- Does the guardrail allow-list gate the selection before argmax?
- At which level (goal / plan-step / action) does each re-score happen?
Where this points next
Prioritization assumes the candidate set is given — it ranks actions the agent already knows about. But the highest-value action is sometimes one no candidate generator proposed, because the relevant information is an unknown unknown: a file nobody read, a hypothesis nobody formed, a failure mode nobody anticipated. A pure prioritizer can only sharpen choices among the options on the table; it cannot widen the table. Lesson 23, Exploration and discovery, is the frontier pattern that does exactly that — the agent proactively searches for information, hypotheses, and possibilities beyond its current candidate set. The two patterns are complementary: exploration generates candidates, prioritization decides which to spend on, and the infoGain term you saw in §3 is precisely the seam where they meet.
argmax over allowed actions. Run it at three levels — strategic goal, plan-step, tactical action — and re-score whenever an observation changes the state, not on every tick. The book's LangChain PM agent shows the simplest rule-based end (keyword → P0/P1/P2, default P1); the weighted scorer generalizes it. This dynamic self-ordering under conflicting goals and limited resources is what makes the system an agent rather than a script.Interview prompts
- Why does an autonomous agent need prioritization when it already has a plan? (§1 — a plan is frozen at write time; with more useful actions than budget and a changing world, the agent must keep the plan as a candidate set and re-sort it after each observation. That dynamic re-ordering is what distinguishes an agent from a script.)
- Name the criteria a priority score should combine, and write one scoring function. (§2 — urgency, importance, dependency, resource availability, cost/benefit, user preference; e.g. score = wval·value + wdep·unlocks + winfo·infoGain − wrisk·risk − wcost·cost, then argmax over allowed actions.)
- Your agent keeps reading more files instead of running the one failing test. What's wrong? (§3, §7 — the cost term is under-weighted and infoGain is ignored; the cheap high-information probe (run the test) should outscore expensive low-information work (read files). Re-balance weights so cost and infoGain bite.)
- When should the agent reprioritize, and when should it NOT? (§3 — re-score on new evidence: a tool returned, a test failed, budget crossed a threshold, the user replied. Do not re-score on every tick with no new information — that wastes budget and causes thrash between near-tied actions.)
- What are the three levels of prioritization, and what's the design error in mixing them? (§5 — strategic goal ordering (L1, slow), plan-step sequencing (L2, dependency-driven), tactical action selection (L3, per-tick). Re-litigating the strategic goal on every tool return is paralysis; never revisiting it grinds on a stale objective.)
- How does the book's project-manager agent implement prioritization, and where does it sit on the simple→complex spectrum? (§6 — a LangChain ReAct agent with a prompt rule mapping "ASAP/urgent/critical" → P0 and defaulting unstated priority to P1; it is the simplest rule-based end of the simple-rule → scoring-system → LLM-reasoning spectrum.)
- How do prioritization and exploration relate? (Hand-off — exploration widens the candidate set (finds unknown unknowns); prioritization ranks within it. The infoGain term is the seam: it lets the prioritizer pay for exploratory probes.)