all_lessons/agentic_systems/16 · RAGlesson 17 / 25

Part VI - Knowledge, communication, and optimization

RAG and agentic retrieval - open-book reasoning

Until now the agent has reasoned from what the model already knows plus whatever we paste into the prompt. That knowledge is frozen at training time, blind to your private documents, and silently confident when it is wrong. This lesson gives the agent a library card: a way to look things up before it answers, and — the agentic twist — a way to judge what it found, notice gaps, resolve contradictions, and decide when it has read enough.

Book source
Chapter 14 — Knowledge Retrieval / RAG (知识检索); PDF outline pages 153-163. The chapter develops embeddings, semantic similarity, chunking, vector databases, BM25/hybrid retrieval, GraphRAG, and Agentic RAG, with worked examples in Google ADK (Google Search tool, Vertex AI RAG Corpus) and a full LangChain + LangGraph pipeline.
Linear position
Prerequisite: Lesson 07 (Tool use — function calling as controlled action) and Lesson 14 (Exception handling and recovery). Retrieval is just a tool call whose output is evidence, and a query that returns nothing is a failure the agent must recover from, not crash on.
New capability: External, verifiable evidence acquisition with provenance, sufficiency checks, and conflict resolution — turning a closed-book model into an open-book reasoner.
The plan
Five moves. (1) Name the disease — why a frozen, closed-book model hallucinates — and state the cure: retrieve, augment, generate. (2) Build the retrieval stack from the bottom up: chunk → embed → index → search → rerank → pack, with the embedding/semantic-distance idea made concrete with real vectors. (3) Get quantitative: a worked example of cosine similarity, a top-k vs. similarity-threshold tradeoff, and the token/latency cost of stuffing context. (4) Add the agentic reasoning layer — the four jobs the book gives the gatekeeper agent (verify sources, resolve conflicts, decompose multi-step questions, detect gaps and reach for tools). (5) Touch GraphRAG for connection-heavy questions, then thread it all through a running policy/research assistant. We close on the hand-off into agent-to-agent communication.

1 · The disease and the cure

A large language model is a closed-book exam taker. Everything it "knows" was compressed into its weights at training time. That gives it three permanent handicaps the book is blunt about: it cannot see information created after its cutoff, it cannot see your private corpus (the company wiki, the ticket history, the contract PDFs), and — most dangerous — when it does not know, it does not reliably say so. It interpolates a fluent, plausible, wrong answer. We call that a hallucination: text that is grammatically and stylistically perfect and factually fabricated.

Retrieval-Augmented Generation (RAG) is the cure, and the mental model is exactly the one in the book: stop forcing the model to recite from memory, and let it look things up first — like a human consulting a reference book or searching the web before answering. Concretely, when a query arrives it does not go straight to the model. First the system searches an external knowledge base for the most relevant passages; then it augments the prompt by pasting those passages in as context; only then does the enriched prompt reach the model, which now generates an answer grounded in evidence it can quote.

CLOSED-BOOK (plain LLM) query ───────────────────────────────────▶ LLM ──▶ answer (recited from frozen weights; may hallucinate) OPEN-BOOK (RAG) query ──▶ RETRIEVE relevant chunks ──▶ AUGMENT prompt ──▶ LLM ──▶ answer + citations │ │ external KB query + evidence (grounded, verifiable, up to date) (docs / DB / web / vectors)

The book lists the payoff precisely: RAG breaks the static-training-data ceiling (fresh and proprietary facts become reachable), it reduces hallucination by anchoring generation to verifiable data, and — the feature that makes it trustworthy in production — it can attach citations. The answer now points at its sources, so a human can check it. That last property is why, for an agent, retrieved text should always be treated as evidence with provenance, never as free-floating truth.

2 · The retrieval stack, bottom up

"Search the knowledge base" hides a pipeline. The book builds it from the ground up, and so will we. Six stages: chunk, embed, index, search, rerank/filter, pack.

chunkSplit big documents into small passages (a section, a paragraph, a few sentences). You cannot feed a 50-page manual to the model, and you do not want to — you want to retrieve the "Troubleshooting" passage, not the whole book. Chunking decides what unit can be retrieved, so it directly controls whether context survives.
embedTurn each chunk into an embedding: a vector of numbers that places its meaning in a geometric space. Texts that mean similar things land near each other.
indexStore those vectors in a vector database built for fast nearest-neighbor lookup over millions of high-dimensional vectors.
searchEmbed the user's query the same way, then find the chunks whose vectors are closest in meaning — semantic search, not keyword matching.
rerank / filterTrim the raw hits by relevance, source authority, and recency; optionally re-score with a stronger model. Top-k rank is not the same as decision relevance.
packConcatenate the surviving passages into the prompt as context, keeping each chunk's source metadata so the answer can cite it.

2.1 Embeddings and semantic distance — the geometry of meaning

Everything rests on embeddings, so we make them concrete with the book's own toy example. Imagine a tiny 2-dimensional meaning-space (real ones have hundreds to thousands of dimensions, but 2D is enough to see the idea). The book places words like this: "cat" ≈ (2, 3), "kitten" ≈ (2.1, 3.1), and "car" ≈ (8, 1). Cat and kitten sit almost on top of each other; car is far away. That spatial closeness is semantic similarity. The magic this buys you: the query "a furry feline companion" and the document phrase "a domestic cat" share no words, yet their embeddings are near neighbours — so semantic search finds the right passage where keyword search would find nothing.

How "near" is measured matters. The standard score is cosine similarity — the cosine of the angle between two vectors, which ignores their length and looks only at direction (meaning):

cos(θ) = (a · b) / (‖a‖ · ‖b‖) ,   ranging from −1 (opposite) through 0 (unrelated) to +1 (identical meaning).

Worked number. Take the book's vectors. For a = "cat" = (2,3) and b = "kitten" = (2.1, 3.1): the dot product is 2·2.1 + 3·3.1 = 4.2 + 9.3 = 13.5; the norms are ‖a‖ = √(4+9) = √13 ≈ 3.606 and ‖b‖ = √(4.41+9.61) = √14.02 ≈ 3.744; so cos = 13.5 / (3.606·3.744) = 13.5 / 13.50 ≈ 0.9999 — essentially identical meaning. Now a = "cat" vs. c = "car" = (8,1): dot product 2·8 + 3·1 = 19; ‖c‖ = √65 ≈ 8.062; cos = 19 / (3.606·8.062) = 19 / 29.07 ≈ 0.654 — clearly less related. A retriever ranks candidate chunks by exactly this number and keeps the top of the list.

2.2 Vector databases, and the keyword retrieval it complements

Storing and querying millions of these vectors fast is its own engineering problem, which is what a vector database solves. The book names the field directly: Pinecone, Weaviate, Chroma, Milvus, Qdrant, and notes that general stores like Redis, Elasticsearch, and Postgres (via the pgvector extension) can do it too. Under the hood, exact nearest-neighbor over millions of high-dimensional vectors is too slow, so these systems use approximate nearest-neighbor (ANN) algorithms — the book cites HNSW (a navigable small-world graph you can walk toward the query) — and lower-level libraries such as FAISS and ScaNN. ANN trades a tiny chance of missing the true closest vector for an enormous speedup.

Crucially, the book does not declare semantic search universally superior. It contrasts two retrieval styles and recommends combining them:

BM25 (keyword)
Ranks by term frequency / exact word overlap. Wins when the exact term matters — error codes, product SKUs, legal clause numbers, rare proper nouns. Blind to meaning: "furry feline companion" never matches "cat".
Vector (semantic)
Ranks by embedding distance. Wins when meaning matters and wording differs. Can drift toward "topically near but literally wrong" passages.
Hybrid
Run both, fuse the scores. The book's recommendation: combine BM25's precision on exact terms with semantic search's recall on paraphrase for the most robust, accurate retrieval.
What makes or breaks it
The book is explicit: chunking strategy, metadata quality, recency, and corpus synchronization decide whether RAG grounds the answer or injects noise.

3 · The quantitative core — top-k, threshold, and the cost of context

The single most consequential RAG knob is how many chunks you keep. The book's Vertex AI RAG example exposes exactly two controls for this — SIMILARITY_TOP_K (how many results to return) and VECTOR_DISTANCE_THRESHOLD (the maximum semantic distance to accept). They encode the central tension of retrieval:

The interactive below makes this trade visible. Each bar is a candidate chunk scored by similarity to the query; raise the threshold or lower top-k to tighten, and watch precision, recall, and the packed token cost move.

Retrieval tuning — top-k and similarity threshold vs. precision, recall, and token cost
Ten candidate chunks, sorted by cosine similarity to the query. Green = truly relevant to the answer, red = noise. A chunk is packed into context only if it clears the threshold and sits within the top-k. The goal is to capture the green chunks while excluding the red — the book's "ground, don't inject noise" rule.
Chunks packed
5
Precision
Recall
Context tokens
Show the core JS
// each chunk: { sim, relevant, tokens }
const kept = chunks
  .filter(c => c.sim >= threshold)   // VECTOR_DISTANCE_THRESHOLD
  .slice(0, topK);                   // SIMILARITY_TOP_K

const tp = kept.filter(c => c.relevant).length;
const precision = kept.length ? tp / kept.length : 0;          // how clean is context?
const recall    = totalRelevant ? tp / totalRelevant : 0;      // did we get the answer?
const tokens    = kept.reduce((s,c) => s + c.tokens, 0);        // what we pay to pack

Worked cost arithmetic. Say each chunk averages 220 tokens and your model bills (round numbers) about $3 per million input tokens. A lazy "just take top-k = 10" policy packs 10 × 220 = 2,200 evidence tokens per query, costing 2,200 × $3/10⁶ ≈ $0.0066 in retrieved context alone. If only 3 of those chunks were actually relevant, you paid for 1,540 noise tokens that also raise the hallucination risk by diluting the signal. Tightening to the 4 chunks above a 0.6 threshold cuts context to 4 × 220 = 880 tokens (≈ $0.0026) and gives the model a cleaner, easier prompt. Across a million queries that is the difference between roughly $6,600 and $2,600 — and the smaller prompt usually answers better. This is the lesson the book repeats and the one the next lesson on resource optimization formalizes: pack context by decision relevance, not by raw top-k rank.

4 · The agentic layer — from passive pipeline to active gatekeeper

Plain RAG retrieves once, blindly, and pours the top-k into the prompt. Agentic RAG is the book's evolution of the pattern: it inserts a reasoning and decision layer so the agent becomes an active gatekeeper and knowledge refiner rather than a passive data pipe. The agent does not accept the first retrieval; it inspects quality, relevance, and completeness, and acts. The chapter gives four concrete jobs, each with its own example — these are worth memorizing because they are the pattern.

Agentic jobThe book's exampleWhat the agent does
Reflection & source verification"What is the company's remote-work policy?" retrieves both a 2020 blog post and the 2025 official policy.Reads document metadata, identifies the 2025 policy as the current authoritative source, discards the stale 2020 blog, and sends only the correct passage to the model.
Conflict resolution"What is the Q1 budget for Project Alpha?" returns a draft proposal saying €50,000 and a final report saying €65,000.Detects the contradiction, applies a precedence rule (the final financial report outranks the draft), and answers from the most reliable figure.
Multi-step decomposition"How do our product's features and pricing compare to competitor X?"Splits into sub-queries — own features, own pricing, competitor features, competitor pricing — retrieves each, then synthesizes a structured comparison.
Gap detection & tool use"What was the market's immediate reaction to yesterday's product launch?" — the internal KB (updated weekly) has nothing.Recognizes the gap, reaches past the static store to call a live web-search API for fresh news and sentiment, then answers from that.

This is where the loop closes back onto earlier lessons. The "decide whether to retrieve, judge what came back, retrieve again if needed" structure is the agent control loop from Lesson 01, wired around a retrieval tool (Lesson 07). The four jobs map onto patterns you have already built: reflection (Lesson 06) is the source-verification step; planning/decomposition (Lesson 08) is the multi-step split; and gap-detection-then-tool-call is just goal monitoring (Lesson 13) plus a tool. The book is candid about the cost, though: the agentic layer adds engineering complexity, more compute, and higher latency, and the agent itself becomes a new failure source — flawed reasoning can spin a useless retrieval loop, misjudge a task, or wrongly discard a relevant document. Hence the implementation checklist below insists on a budget and a sufficiency criterion.

4.1 The pattern shape and a sketch

01Decide whether external evidence is even needed (small talk and arithmetic do not warrant a retrieval round-trip).
02Plan queries and retrieve candidate chunks (hybrid: BM25 + semantic).
03Assess relevance, authority, recency, and conflicts; discard the stale and the noise.
04Check sufficiency. If gaps remain and the budget allows, rewrite the queries (or call a tool) and loop; otherwise stop.
05Pack the compact, relevant evidence into context and synthesize an answer with citations.
evidence, rounds = [], 0
while rounds < MAX_ROUNDS:                 # retry budget — no unbounded loops
    rounds += 1
    for query in query_plan:
        hits = hybrid_search(query)        # BM25 + vector
        evidence += [h for h in hits
                     if authoritative(h) and recent(h)]   # source-verification job
    evidence = resolve_conflicts(evidence) # precedence rule: final report > draft
    if sufficient(evidence, goal):         # explicit stopping criterion
        break
    query_plan = rewrite_queries(gaps(evidence, goal))     # close the gaps
answer = synthesize_with_sources(evidence) # every claim carries provenance

5 · GraphRAG — when the answer lives in the connections

Vector search finds chunks that are individually similar to the query. But some questions are answered only by relationships across documents — "how is this executive connected to that acquisition?", "which gene is implicated in this disease through which pathway?". For those, the book introduces GraphRAG: instead of (or alongside) a vector store, retrieval walks an explicit knowledge graph of entities (nodes) and relationships (edges). By traversing edges, GraphRAG can stitch together facts scattered across many documents — the exact "information spread across multiple chunks" weakness of plain vector RAG. The book's use cases are connection-heavy: complex financial analysis, linking companies to market events, and scientific discovery. Its honest drawback: building and maintaining a high-quality graph is expensive, specialized, and less flexible, and traversal can add latency over a plain vector lookup. Reach for it when deep relational insight is the whole point; otherwise vector or hybrid RAG is simpler.

6 · Running example — the policy/research assistant

Thread the whole stack through the track's running assistant, now asked: "What is our current remote-work policy, and how does our Q1 Project Alpha budget compare to the original plan?"

  1. Decide & plan. This needs proprietary, current facts → retrieve. Decompose into two sub-goals: the policy, and the budget comparison.
  2. Retrieve. Hybrid search the internal corpus. Policy query returns the 2020 blog (cos 0.71) and the 2025 official policy (cos 0.83). Budget query returns the draft proposal (€50,000) and the final report (€65,000).
  3. Verify & resolve. Metadata says the 2025 doc supersedes the 2020 blog → discard the blog. The €50k/€65k conflict is resolved by precedence: the final financial report wins → use €65,000, note the change from the €50,000 plan.
  4. Check sufficiency. Both sub-goals are covered by authoritative sources; no gap → stop, do not loop. (Had the launch-reaction question appeared, the gap-detection job would fire a live web search.)
  5. Synthesize with provenance. "Per the 2025 Remote Work Policy [doc id], …; the Q1 Project Alpha budget is €65,000 per the Q1 Final Report [doc id], up from the €50,000 in the original proposal [doc id]." Every claim is traceable.

6.1 What the book's code shows

The chapter grounds this in three runnable surfaces, all implementing the same retrieve → augment → generate loop. (1) Google ADK with the built-in google_search tool — the simplest RAG: a research-assistant agent that grounds answers in live web results. (2) Google ADK with VertexAiRagMemoryService, pointed at a managed Vertex AI RAG Corpus, exposing exactly the two knobs from Section 3, SIMILARITY_TOP_K = 5 and VECTOR_DISTANCE_THRESHOLD = 0.7. (3) A full LangChain + LangGraph pipeline: load a document, CharacterTextSplitter chunks it (chunk_size 500, overlap 50), OpenAIEmbeddings embeds it into a Weaviate vector store, and a LangGraph StateGraph wires two nodes — retrieve_documents_node then generate_response_node — into the canonical retrieve-then-generate graph, with a prompt template that explicitly tells the model to answer only from the retrieved context and to say "I don't know" otherwise. That last instruction is the cheapest hallucination guard there is.

Failure modes & checklist

Failure modes

  • Top-k dumping. Pouring raw top-k chunks into context with no authority/recency check — the book's "injects noise, distracts the LLM" failure.
  • Stale corpus. The KB drifted from the source of truth; the agent confidently cites the 2020 blog. RAG is only as fresh as its sync job.
  • Unhandled conflicts. Two sources disagree (€50k vs €65k) and the agent picks one arbitrarily or splices both into nonsense.
  • Scattered context. The answer needs facts from three chunks; the retriever returns one. (GraphRAG or query decomposition is the fix.)
  • Unbounded retrieval loops. Agentic rewrite-and-retry with no budget or sufficiency criterion — the book's warning that the agent becomes its own failure source.
  • Lost provenance. Evidence packed without source ids, so the answer cannot be cited or verified.

Implementation checklist

  • Which corpus is authoritative, and how often is it synced?
  • Hybrid (BM25 + vector) or single-mode retrieval, and why?
  • What are top-k and the distance threshold, and were they tuned on real queries?
  • How are queries rewritten / decomposed when a gap is found?
  • What metadata (date, author, source id) is preserved through to the answer?
  • What counts as enough evidence — the explicit sufficiency rule?
  • What is the retry budget (MAX_ROUNDS) that stops loops?
  • What is the conflict-precedence policy (e.g. final report > draft, newer > older)?

Where this points next

RAG turned the agent from a closed-book reciter into an open-book reasoner that gathers, judges, and cites external evidence. But notice what the gap-detection example quietly assumed: sometimes the knowledge the agent needs is not in a document at all — it lives in another agent that owns a capability or a data source. Reaching that knowledge is not a vector search; it is a conversation with a remote peer that has its own identity, capabilities, and task lifecycle. Lesson 17 — A2A: agent-to-agent communication builds that protocol: how a remote agent advertises what it can do, how a task is dispatched and tracked across the boundary, and how messages stay identifiable and traceable. Retrieval grounded the agent in documents; A2A grounds it in a network of other agents.

Takeaway
A plain LLM is a closed-book exam taker — frozen, blind to private data, and confidently wrong when it does not know. RAG makes it open-book: retrieve relevant chunks from an external store, augment the prompt, generate a grounded, citable answer. The stack is chunk → embed → index → search → rerank → pack, with meaning measured by cosine similarity over embeddings, stored in a vector DB (Pinecone/Weaviate/Chroma/Milvus/Qdrant, ANN via HNSW/FAISS), and best retrieved hybrid (BM25 + semantic). The central knob is top-k × threshold: too tight kills recall, too loose injects noise and tokens — pack by decision relevance, not raw rank. Agentic RAG adds a gatekeeper that verifies sources, resolves conflicts (€65k final > €50k draft), decomposes multi-step questions, and detects gaps to call tools — all under a retry budget and an explicit sufficiency rule. GraphRAG handles questions whose answer lives in the connections between documents. Always treat retrieved text as evidence with provenance.

Interview prompts