Part VI - Knowledge, communication, and optimization
RAG and agentic retrieval - open-book reasoning
Until now the agent has reasoned from what the model already knows plus whatever we paste into the prompt. That knowledge is frozen at training time, blind to your private documents, and silently confident when it is wrong. This lesson gives the agent a library card: a way to look things up before it answers, and — the agentic twist — a way to judge what it found, notice gaps, resolve contradictions, and decide when it has read enough.
New capability: External, verifiable evidence acquisition with provenance, sufficiency checks, and conflict resolution — turning a closed-book model into an open-book reasoner.
1 · The disease and the cure
A large language model is a closed-book exam taker. Everything it "knows" was compressed into its weights at training time. That gives it three permanent handicaps the book is blunt about: it cannot see information created after its cutoff, it cannot see your private corpus (the company wiki, the ticket history, the contract PDFs), and — most dangerous — when it does not know, it does not reliably say so. It interpolates a fluent, plausible, wrong answer. We call that a hallucination: text that is grammatically and stylistically perfect and factually fabricated.
Retrieval-Augmented Generation (RAG) is the cure, and the mental model is exactly the one in the book: stop forcing the model to recite from memory, and let it look things up first — like a human consulting a reference book or searching the web before answering. Concretely, when a query arrives it does not go straight to the model. First the system searches an external knowledge base for the most relevant passages; then it augments the prompt by pasting those passages in as context; only then does the enriched prompt reach the model, which now generates an answer grounded in evidence it can quote.
The book lists the payoff precisely: RAG breaks the static-training-data ceiling (fresh and proprietary facts become reachable), it reduces hallucination by anchoring generation to verifiable data, and — the feature that makes it trustworthy in production — it can attach citations. The answer now points at its sources, so a human can check it. That last property is why, for an agent, retrieved text should always be treated as evidence with provenance, never as free-floating truth.
2 · The retrieval stack, bottom up
"Search the knowledge base" hides a pipeline. The book builds it from the ground up, and so will we. Six stages: chunk, embed, index, search, rerank/filter, pack.
2.1 Embeddings and semantic distance — the geometry of meaning
Everything rests on embeddings, so we make them concrete with the book's own toy example. Imagine a tiny 2-dimensional meaning-space (real ones have hundreds to thousands of dimensions, but 2D is enough to see the idea). The book places words like this: "cat" ≈ (2, 3), "kitten" ≈ (2.1, 3.1), and "car" ≈ (8, 1). Cat and kitten sit almost on top of each other; car is far away. That spatial closeness is semantic similarity. The magic this buys you: the query "a furry feline companion" and the document phrase "a domestic cat" share no words, yet their embeddings are near neighbours — so semantic search finds the right passage where keyword search would find nothing.
How "near" is measured matters. The standard score is cosine similarity — the cosine of the angle between two vectors, which ignores their length and looks only at direction (meaning):
Worked number. Take the book's vectors. For a = "cat" = (2,3) and b = "kitten" = (2.1, 3.1): the dot product is 2·2.1 + 3·3.1 = 4.2 + 9.3 = 13.5; the norms are ‖a‖ = √(4+9) = √13 ≈ 3.606 and ‖b‖ = √(4.41+9.61) = √14.02 ≈ 3.744; so cos = 13.5 / (3.606·3.744) = 13.5 / 13.50 ≈ 0.9999 — essentially identical meaning. Now a = "cat" vs. c = "car" = (8,1): dot product 2·8 + 3·1 = 19; ‖c‖ = √65 ≈ 8.062; cos = 19 / (3.606·8.062) = 19 / 29.07 ≈ 0.654 — clearly less related. A retriever ranks candidate chunks by exactly this number and keeps the top of the list.
2.2 Vector databases, and the keyword retrieval it complements
Storing and querying millions of these vectors fast is its own engineering problem, which is what a vector database solves. The book names the field directly: Pinecone, Weaviate, Chroma, Milvus, Qdrant, and notes that general stores like Redis, Elasticsearch, and Postgres (via the pgvector extension) can do it too. Under the hood, exact nearest-neighbor over millions of high-dimensional vectors is too slow, so these systems use approximate nearest-neighbor (ANN) algorithms — the book cites HNSW (a navigable small-world graph you can walk toward the query) — and lower-level libraries such as FAISS and ScaNN. ANN trades a tiny chance of missing the true closest vector for an enormous speedup.
Crucially, the book does not declare semantic search universally superior. It contrasts two retrieval styles and recommends combining them:
3 · The quantitative core — top-k, threshold, and the cost of context
The single most consequential RAG knob is how many chunks you keep. The book's Vertex AI RAG example exposes exactly two controls for this — SIMILARITY_TOP_K (how many results to return) and VECTOR_DISTANCE_THRESHOLD (the maximum semantic distance to accept). They encode the central tension of retrieval:
- Keep too few chunks (small top-k, strict threshold) and you miss the passage that held the answer — the book's "information scattered across multiple chunks" failure. Recall drops.
- Keep too many chunks (large top-k, loose threshold) and you drag in irrelevant passages. The book is emphatic that retrieving irrelevant chunks injects noise that distracts the LLM, and you pay for every extra token in latency and money. Precision drops.
The interactive below makes this trade visible. Each bar is a candidate chunk scored by similarity to the query; raise the threshold or lower top-k to tighten, and watch precision, recall, and the packed token cost move.
Worked cost arithmetic. Say each chunk averages 220 tokens and your model bills (round numbers) about $3 per million input tokens. A lazy "just take top-k = 10" policy packs 10 × 220 = 2,200 evidence tokens per query, costing 2,200 × $3/10⁶ ≈ $0.0066 in retrieved context alone. If only 3 of those chunks were actually relevant, you paid for 1,540 noise tokens that also raise the hallucination risk by diluting the signal. Tightening to the 4 chunks above a 0.6 threshold cuts context to 4 × 220 = 880 tokens (≈ $0.0026) and gives the model a cleaner, easier prompt. Across a million queries that is the difference between roughly $6,600 and $2,600 — and the smaller prompt usually answers better. This is the lesson the book repeats and the one the next lesson on resource optimization formalizes: pack context by decision relevance, not by raw top-k rank.
4 · The agentic layer — from passive pipeline to active gatekeeper
Plain RAG retrieves once, blindly, and pours the top-k into the prompt. Agentic RAG is the book's evolution of the pattern: it inserts a reasoning and decision layer so the agent becomes an active gatekeeper and knowledge refiner rather than a passive data pipe. The agent does not accept the first retrieval; it inspects quality, relevance, and completeness, and acts. The chapter gives four concrete jobs, each with its own example — these are worth memorizing because they are the pattern.
| Agentic job | The book's example | What the agent does |
|---|---|---|
| Reflection & source verification | "What is the company's remote-work policy?" retrieves both a 2020 blog post and the 2025 official policy. | Reads document metadata, identifies the 2025 policy as the current authoritative source, discards the stale 2020 blog, and sends only the correct passage to the model. |
| Conflict resolution | "What is the Q1 budget for Project Alpha?" returns a draft proposal saying €50,000 and a final report saying €65,000. | Detects the contradiction, applies a precedence rule (the final financial report outranks the draft), and answers from the most reliable figure. |
| Multi-step decomposition | "How do our product's features and pricing compare to competitor X?" | Splits into sub-queries — own features, own pricing, competitor features, competitor pricing — retrieves each, then synthesizes a structured comparison. |
| Gap detection & tool use | "What was the market's immediate reaction to yesterday's product launch?" — the internal KB (updated weekly) has nothing. | Recognizes the gap, reaches past the static store to call a live web-search API for fresh news and sentiment, then answers from that. |
This is where the loop closes back onto earlier lessons. The "decide whether to retrieve, judge what came back, retrieve again if needed" structure is the agent control loop from Lesson 01, wired around a retrieval tool (Lesson 07). The four jobs map onto patterns you have already built: reflection (Lesson 06) is the source-verification step; planning/decomposition (Lesson 08) is the multi-step split; and gap-detection-then-tool-call is just goal monitoring (Lesson 13) plus a tool. The book is candid about the cost, though: the agentic layer adds engineering complexity, more compute, and higher latency, and the agent itself becomes a new failure source — flawed reasoning can spin a useless retrieval loop, misjudge a task, or wrongly discard a relevant document. Hence the implementation checklist below insists on a budget and a sufficiency criterion.
4.1 The pattern shape and a sketch
evidence, rounds = [], 0
while rounds < MAX_ROUNDS: # retry budget — no unbounded loops
rounds += 1
for query in query_plan:
hits = hybrid_search(query) # BM25 + vector
evidence += [h for h in hits
if authoritative(h) and recent(h)] # source-verification job
evidence = resolve_conflicts(evidence) # precedence rule: final report > draft
if sufficient(evidence, goal): # explicit stopping criterion
break
query_plan = rewrite_queries(gaps(evidence, goal)) # close the gaps
answer = synthesize_with_sources(evidence) # every claim carries provenance
5 · GraphRAG — when the answer lives in the connections
Vector search finds chunks that are individually similar to the query. But some questions are answered only by relationships across documents — "how is this executive connected to that acquisition?", "which gene is implicated in this disease through which pathway?". For those, the book introduces GraphRAG: instead of (or alongside) a vector store, retrieval walks an explicit knowledge graph of entities (nodes) and relationships (edges). By traversing edges, GraphRAG can stitch together facts scattered across many documents — the exact "information spread across multiple chunks" weakness of plain vector RAG. The book's use cases are connection-heavy: complex financial analysis, linking companies to market events, and scientific discovery. Its honest drawback: building and maintaining a high-quality graph is expensive, specialized, and less flexible, and traversal can add latency over a plain vector lookup. Reach for it when deep relational insight is the whole point; otherwise vector or hybrid RAG is simpler.
6 · Running example — the policy/research assistant
Thread the whole stack through the track's running assistant, now asked: "What is our current remote-work policy, and how does our Q1 Project Alpha budget compare to the original plan?"
- Decide & plan. This needs proprietary, current facts → retrieve. Decompose into two sub-goals: the policy, and the budget comparison.
- Retrieve. Hybrid search the internal corpus. Policy query returns the 2020 blog (cos 0.71) and the 2025 official policy (cos 0.83). Budget query returns the draft proposal (€50,000) and the final report (€65,000).
- Verify & resolve. Metadata says the 2025 doc supersedes the 2020 blog → discard the blog. The €50k/€65k conflict is resolved by precedence: the final financial report wins → use €65,000, note the change from the €50,000 plan.
- Check sufficiency. Both sub-goals are covered by authoritative sources; no gap → stop, do not loop. (Had the launch-reaction question appeared, the gap-detection job would fire a live web search.)
- Synthesize with provenance. "Per the 2025 Remote Work Policy [doc id], …; the Q1 Project Alpha budget is €65,000 per the Q1 Final Report [doc id], up from the €50,000 in the original proposal [doc id]." Every claim is traceable.
6.1 What the book's code shows
The chapter grounds this in three runnable surfaces, all implementing the same retrieve → augment → generate loop. (1) Google ADK with the built-in google_search tool — the simplest RAG: a research-assistant agent that grounds answers in live web results. (2) Google ADK with VertexAiRagMemoryService, pointed at a managed Vertex AI RAG Corpus, exposing exactly the two knobs from Section 3, SIMILARITY_TOP_K = 5 and VECTOR_DISTANCE_THRESHOLD = 0.7. (3) A full LangChain + LangGraph pipeline: load a document, CharacterTextSplitter chunks it (chunk_size 500, overlap 50), OpenAIEmbeddings embeds it into a Weaviate vector store, and a LangGraph StateGraph wires two nodes — retrieve_documents_node then generate_response_node — into the canonical retrieve-then-generate graph, with a prompt template that explicitly tells the model to answer only from the retrieved context and to say "I don't know" otherwise. That last instruction is the cheapest hallucination guard there is.
Failure modes & checklist
Failure modes
- Top-k dumping. Pouring raw top-k chunks into context with no authority/recency check — the book's "injects noise, distracts the LLM" failure.
- Stale corpus. The KB drifted from the source of truth; the agent confidently cites the 2020 blog. RAG is only as fresh as its sync job.
- Unhandled conflicts. Two sources disagree (€50k vs €65k) and the agent picks one arbitrarily or splices both into nonsense.
- Scattered context. The answer needs facts from three chunks; the retriever returns one. (GraphRAG or query decomposition is the fix.)
- Unbounded retrieval loops. Agentic rewrite-and-retry with no budget or sufficiency criterion — the book's warning that the agent becomes its own failure source.
- Lost provenance. Evidence packed without source ids, so the answer cannot be cited or verified.
Implementation checklist
- Which corpus is authoritative, and how often is it synced?
- Hybrid (BM25 + vector) or single-mode retrieval, and why?
- What are top-k and the distance threshold, and were they tuned on real queries?
- How are queries rewritten / decomposed when a gap is found?
- What metadata (date, author, source id) is preserved through to the answer?
- What counts as enough evidence — the explicit sufficiency rule?
- What is the retry budget (MAX_ROUNDS) that stops loops?
- What is the conflict-precedence policy (e.g. final report > draft, newer > older)?
Where this points next
RAG turned the agent from a closed-book reciter into an open-book reasoner that gathers, judges, and cites external evidence. But notice what the gap-detection example quietly assumed: sometimes the knowledge the agent needs is not in a document at all — it lives in another agent that owns a capability or a data source. Reaching that knowledge is not a vector search; it is a conversation with a remote peer that has its own identity, capabilities, and task lifecycle. Lesson 17 — A2A: agent-to-agent communication builds that protocol: how a remote agent advertises what it can do, how a task is dispatched and tracked across the boundary, and how messages stay identifiable and traceable. Retrieval grounded the agent in documents; A2A grounds it in a network of other agents.
Interview prompts
- What problem does RAG solve, and what are its two core steps? (§1 — a closed-book LLM is frozen, blind to private/current data, and hallucinates; RAG retrieves relevant chunks and augments the prompt before generation, grounding the answer in verifiable, citable evidence.)
- Why use semantic (vector) search instead of keyword search — and when is keyword search actually better? (§2 — embeddings put meaning in geometric space so "furry feline companion" matches "cat" despite zero word overlap; but BM25/keyword wins for exact terms like error codes and SKUs, so hybrid retrieval fuses both.)
- How is similarity between a query and a chunk actually scored? (§2.1 — cosine similarity, cos θ = a·b / (‖a‖‖b‖), the angle between embedding vectors, from −1 to +1; chunks are ranked by it and the top-k kept.)
- You set top-k = 10 and accuracy got worse and slower. Why, and what do you do? (§3 — large top-k injects irrelevant chunks that distract the model and cost tokens/latency; tighten top-k and the distance threshold, pack by decision relevance not raw rank.)
- What does the "agentic" layer add over plain RAG? (§4 — an active gatekeeper that verifies source authority/recency, resolves conflicting sources by precedence, decomposes multi-step questions, and detects gaps to call external tools — under a retry budget and sufficiency criterion.)
- Two retrieved documents disagree on a number. How should an agentic RAG system respond? (§4 — detect the conflict and apply a precedence rule, e.g. the final financial report (€65k) over the draft (€50k), rather than averaging or picking arbitrarily; report the basis.)
- When would you reach for GraphRAG over vector RAG? (§5 — when the answer depends on relationships across documents (financial linkages, gene–disease pathways) that vector proximity misses; accept the higher build/maintenance cost and latency.)
- How do you stop an agentic retrieval loop from running forever? (§4.1 — an explicit sufficiency rule plus a MAX_ROUNDS retry budget; if gaps remain and the budget is exhausted, answer with caveats rather than looping.)