2026-02-24
Standalone LLM = \(y = \pi(x)\) – a text-only policy with no access to the world
Systems add tools for richer actions and observations. The LLM is a controller that selects which tool to call, with what arguments, and how to interpret the result.
Tool-selection example – which tool should the controller invoke?
| User query | Best tool | Decision criterion |
|---|---|---|
| "What was Apple's Q3 2025 revenue?" | search_corpus |
Needs factual grounding from external docs |
| "What is 15% of $2.3B?" | calculator |
Deterministic arithmetic -- don't trust LLM math |
| "File a support ticket for order #4521" | api_write |
Side-effectful action -- requires approval gate |
| "Explain the concept of inflation" | none | Parametric knowledge sufficient -- no tool needed |
Today’s arc: single-shot RAG → adaptive retrieval → planner/executor/verifier systems
π_text(x) → tokens
π_action(x) → retrieve → generate
π_action(x, history) → tool → observe → update
π_action(x, history, tools, memory) under compounding error
The key constraint: If each stage is 90% accurate, an \(n\)-step pipeline has \(0.9^n\) reliability. A 4-step pipeline succeeds ~65% of the time. Everything in this lecture is about managing this tradeoff.
RAG is tool-augmented generation with a retrieval tool.
Pretraining captures a static snapshot. Three failure modes motivate external tools: hallucination, knowledge cutoff, and domain bias.
Enterprise RAG deployments report 15–40% hallucination rates without retrieval grounding.
Lewis et al. (2020): retrieval as a latent variable — the generator marginalizes over retrieved documents.
\[ P(y \mid x) = \sum_{z \in \text{top-}k} P(z \mid x) \cdot P(y \mid x, z) \]
In practice: most production systems skip full marginalization — they concatenate top-k docs into the prompt and let the LLM attend selectively.
Tool call (action):
Validation + retry: LLM generates tool call → validate against schema → if invalid, return structured error → LLM reformulates (max 2 retries, then fail gracefully). The next slide puts this pattern into practice.
Exercise (2 min)
Scenario: You’re building a tool book_restaurant(name, party_size, date, time).
party_size? What’s required vs. optional?{"name": "", "party_size": -3, "date": "yesterday"} — which fields fail validation? What should the retry behavior be?Key takeaway: Schema-first → validate → retry → typed result. This pattern applies to every tool.
Bi-encoder architecture: encode queries and documents independently into a shared embedding space:
\[ \mathbf{q} = f_\theta(\text{query}), \quad \mathbf{d} = g_\phi(\text{document}), \quad \text{sim}(\mathbf{q}, \mathbf{d}) = \mathbf{q}^\top \mathbf{d} \]
Training uses contrastive loss — push matching \((q, d^+)\) pairs together, push non-matching \((q, d^-)\) pairs apart:
\[ \mathcal{L} = -\log \frac{e^{\text{sim}(q, d^+)}}{e^{\text{sim}(q, d^+)} + \sum_{j=1}^{n} e^{\text{sim}(q, d_j^-)}} \]
Hard negative mining matters more than architecture. DPR (Karpukhin et al., 2020) showed that using BM25-retrieved but irrelevant documents as hard negatives dramatically improves training — easy negatives don’t teach the model to distinguish semantic similarity from surface overlap.
| Bi-encoder | Cross-encoder | |
|---|---|---|
| Scores | \(f(q) \cdot g(d)\) independently | \(h(q, d)\) jointly (full attention) |
| Latency | ~1ms per query (precomputed docs) | ~50ms per (q,d) pair |
| Quality | Good | Significantly better |
| Use case | First-stage: retrieve top-100 | Second-stage: rerank to top-5 |
The two-stage retrieve-then-rerank pipeline gets the best of both: bi-encoder speed for recall, cross-encoder quality for precision.
Concept Check – Quick Poll
For each scenario, vote: one-shot RAG is enough or multi-step tool use required?
Key distinction: One-shot RAG suffices when a single retrieval yields a complete answer. Multi-step is needed when the query requires composition, computation, or action.
Retrieval quality is the bottleneck of grounded generation: Most hallucinations are retrieval failures masquerading as generation failures.
Practical defaults (start here, then tune):
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512, # tokens per chunk
chunk_overlap=64, # 12% overlap
separators=["\n\n", "\n", ". ", " "],
)
chunks = splitter.split_documents(documents)
# Always attach metadata: {"source": ..., "section": ..., "date": ...}Bad chunking counterexample: “…liable for damages unless the party provides notice within 30 days” gets split at “unless” – chunk 1 says “liable for damages” (misleading!).
Practical implication: Place the highest-scored chunk first or last, not in the middle. Alternatively, reduce top-k to 3-5 so there is no “middle” to get lost in.
When the system has multiple tools, the controller must route each query to the right one. This is fundamentally an intent classification problem — but with a twist: the label space (available tools) changes dynamically as tools are added or removed.
| Router type | How it works | Best when |
|---|---|---|
| Rule-based | Regex/keyword → tool mapping | Few tools, clear triggers ("calculate", "search for") |
| Classifier | Trained intent model → tool ID | Many tools, fuzzy boundaries, latency-sensitive |
| LLM-based | LLM reads tool descriptions, picks best match | Open-ended queries, new tools added frequently |
Why LLM-based routing works: Gorilla (Patil et al., 2023) showed LLMs can select the correct API from 1,600+ options — tool descriptions are natural language, so matching queries to descriptions is a semantic similarity task the model already has. But LLMs sometimes hallucinate API parameters; schema validation catches this.
The router must know when not to use a tool — wrong tool is worse than no tool:
The deeper point: Tool descriptions are training data for routing. Poorly written tool schemas are equivalent to mislabeled training data — the model will misroute queries.
Standards enforce structured, consistent descriptions that LLMs can reliably parse. This is what makes tool ecosystems composable — any model can use any MCP-compliant tool without bespoke integration.
Discussion (2 min)
You have three tools: search_docs, calculator, send_email. Design a routing policy:
send_email?search_docs and calculator?Most failures occur at system boundaries: Router ↔︎ Tool, Retriever ↔︎ Generator, Generator ↔︎ Verifier.
| Failure type | Symptom | First diagnostic |
|---|---|---|
| Retrieval miss | Wrong answer, no relevant docs in context | Check Recall@k on gold query set |
| Wrong tool selection | Used calculator when should have searched | Log router decisions; check tool selection accuracy |
| Bad arguments | Tool call fails or returns garbage | Schema validation rejection rate |
| Failure type | Symptom | First diagnostic |
|---|---|---|
| Tool timeout/error | Response hangs or returns error | Monitor p95 latency + error rate per tool |
| Verifier miss | Answer looks correct but contains unsupported claims | Run NLI faithfulness check on output vs. observations |
| Unsafe side effect | Tool executes harmful action (wrong email, wrong deletion) | Audit log review; check approval gates were enforced |
Barnett et al. (2024): missing content and wrong granularity account for ~60% of retrieval failures. Recall the \(0.9^n\) constraint: a 4-stage pipeline at 90% per-stage gives 65.6% end-to-end. Component-level monitoring matters more than end-to-end accuracy alone.
Grounded generation: \(P(y \mid x, E)\) where \(E = \{d_1^*, \ldots, d_k^*\}\) are the observations from tool calls
Measuring faithfulness — NLI-based verification: Run a Natural Language Inference model on each output claim \(c_i\) against the evidence \(E\). If \(\text{NLI}(E, c_i) = \text{entailment}\), the claim is grounded. FActScore (Min et al., 2023) automates this: decompose the output into atomic facts, check each against a knowledge source, report the fraction that are supported.
| Observation class | Response policy | Output format |
|---|---|---|
| answer | Strong agreement across sources | {"action": "answer", "text": "...", "citations": ["doc-42"]} |
| abstain | Insufficient evidence | {"action": "abstain", "reason": "no relevant docs found"} |
| conflict | Sources disagree | {"action": "conflict", "source_a": "...", "source_b": "..."} |
| needs-approval | Answer requires side-effectful action | {"action": "needs_approval", "proposed": "...", "risk": "..."} |
System prompt template (reusable):
GROUNDING POLICY:
- If retrieved documents answer the question: cite doc_ids. Action: answer.
- If no relevant documents found: say so honestly. Action: abstain.
- If sources conflict: present both sides with citations. Action: conflict.
- If answering requires a write/delete action: describe the action and
request explicit user approval. Action: needs-approval.
Never fabricate citations. Every claim must reference a retrieved doc_id.Exercise (2 min)
Scenario: Query = “What is the capital of Australia?” Retrieved docs:
| doc_id | score | content |
|---|---|---|
| doc-42 | 0.82 | “Canberra is the capital of Australia…” |
| doc-17 | 0.71 | “Sydney is the largest city in Australia…” |
| doc-88 | 0.45 | “Australia is a country in the Southern Hemisphere…” |
The LLM responds: {"action": "answer", "text": "The capital is Canberra [doc-42, doc-99]", "citations": ["doc-42", "doc-99"]}
doc-99 but only doc-42, doc-17, doc-88 were retrieved. What went wrong?ReAct (Yao et al., 2023) interleaves chain-of-thought reasoning with tool calls. The key contribution is not “LLM + tools” — it’s that reasoning traces between actions prevent hallucinated reasoning chains.
Thought: I need Apple's Q3 2025 revenue and Q2 2025 revenue to compare.
Action: search_corpus(query="Apple Q3 2025 revenue")
Observation: "Apple reported $85.8B in Q3 2025..."
Thought: Got Q3. Now I need Q2. ← reasoning anchored in observation
Action: search_corpus(query="Apple Q2 2025 revenue")
Observation: "Apple reported $81.4B in Q2 2025..."
Thought: I have both. Q3 - Q2 = $4.4B. ← synthesis before final answer
Action: calculator(expr="85.8 - 81.4")
Observation: 4.4
Answer: "Apple's revenue increased by $4.4B from Q2 to Q3 2025."The LLM decides when to retrieve:
[Retrieve], [IsRel], [IsSup], [IsUse] — trained via distillation from GPT-4 judgments. The model learns to self-assess at inference time without an external verifier.Discussion (2 min)
When should the agent stop tool use and answer? Propose:
These paradigms differ mainly in who controls tool use and how verification is handled:
| RAG | ReAct | Toolformer | AutoGPT / BabyAGI | |
|---|---|---|---|---|
| Who selects tools | Hardcoded (always retrieval) | LLM per-step via prompt | LLM learned via self-supervision | LLM + autonomous planner |
| Loop control | Single-shot, no loop | LLM-driven thought-action loop | Inline during generation | External orchestrator + LLM |
| Planning horizon | None (one step) | Reactive (next step only) | None (token-level) | Full plan upfront, then execute |
| Human oversight | N/A (read-only) | Optional | None | None (by design) |
| Key paper | Lewis et al. 2020 | Yao et al. 2023 | Schick et al. 2023 | Significant Gravitas 2023 |
The autonomy spectrum: RAG (no autonomy) → ReAct (step-by-step autonomy) → AutoGPT (full autonomy). More autonomy = more capability but exponentially more failure modes. All paradigms are attempts to manage compounding error while increasing autonomy.
AutoGPT / BabyAGI (2023) — the first widely-deployed fully autonomous agents:
Why it mostly fails in practice
Community benchmarking: AutoGPT completed <30% of multi-step tasks end-to-end, despite each individual step being ~85% accurate — compounding error in action.
AutoGPT is what happens when you have planning + execution but no verification. The P/E/V pattern we’ll see next adds the missing piece.
Key distinction: Approaches 1–2 are inference-time solutions (no training changes). Approaches 3–4 require training-time investment but produce models that intrinsically know when to use tools.
Self-RAG is a hybrid: reflection tokens are trained at training-time, but the retrieval decision happens at inference-time.
This spectrum shows how autonomy can be shifted from inference-time control to training-time internalization.
RAG augments LLMs with retrieved text. Now we generalize: every tool follows the same action → observation → response pattern.
Policy matrix – map each tool tier to its scope and approval:
| Tier | Scope | Examples | Approval | Audit |
|---|---|---|---|---|
| read_only | No state changes | search, get_weather | Auto-approve | Log call + result |
| write_limited | Bounded mutations | update_profile, add_to_cart | User confirmation | Log + diff of changes |
| side_effect | External, irreversible | send_email, execute_payment | Human-in-the-loop | Full trace + approval record |
| privileged | System-level access | delete_record, modify_permissions | Multi-party approval | Full trace + compliance review |
Defense in depth: The trust tier system means even if the LLM is tricked into wanting to call send_email, the approval gate blocks it. Layers of defense:
The paradigm comparison reveals a pattern: more autonomous systems (AutoGPT) fail because they lack verification. Less autonomous systems (RAG) are reliable but limited. P/E/V is the synthesis: it keeps the planning ambition of AutoGPT, the step-by-step grounding of ReAct, and adds the missing verification layer.
Why separate planning from execution? Shen et al. (2023) (HuggingGPT) showed that LLMs are effective task planners but poor task executors when doing both at once. Planning is a search problem over action sequences; execution is a control problem over tool interfaces. Mixing them overloads a single generation pass. P/E/V exists to break the \(0.9^n\) curse.
Task: "Book the cheapest SEA→JFK flight on March 5 under $400"
Planner: Step 1: search_flights(from=SEA, to=JFK, date=2026-03-05)
Step 2: filter + sort by price
Step 3: if cheapest < $400 → book_flight (needs approval)
Executor: Step 1 → [flight_a: $350, flight_b: $420, flight_c: $380]
Step 2 → cheapest = flight_a ($350) ✓
Verifier: ✓ Route matches (SEA→JFK) ✓ Date matches (March 5)
✓ Price < $400 ($350) ⚠ book_flight → APPROVAL GATE
→ System: "I found flight_a SEA→JFK on Mar 5 for $350. Shall I book it?"If the verifier catches a mismatch (wrong route, wrong date), the executor does NOT proceed — it re-plans or asks the user.
Compounding error — the fundamental reliability limit
If each tool call has 90% accuracy, a 5-step plan succeeds only \(0.9^5 \approx 59\%\) of the time. At 95% per-step, 5 steps gives 77%. This exponential decay is why agentic systems plateau. The verifier breaks this cascade by catching errors before they propagate.
Cobbe et al. (2021): training a verifier on math solutions was more effective than training a better generator. Generating 100 candidates and picking the best beats generating 1 solution from a 10x larger model.
Exercise (2 min)
Using the SEA→JFK flight booking task from the P/E/V trace above, what happens when each component fails?
Scenario A: The router selects calculator instead of search_flights. What catches this? Where does the error surface?
Scenario B: search_flights times out. Where does the error appear in the agent’s history? What should the planner do on the next step?
Scenario C: The verifier is disabled. The system books a $500 flight — but the user’s budget was $400. What went wrong? Which component should have caught this?
| Metric | Definition | Where measured |
|---|---|---|
| Task success rate | % of tasks completed correctly end-to-end | Final output vs. ground truth |
| Tool selection accuracy | % of queries routed to the correct tool | Router output vs. labeled tool |
| Argument validity rate | % of tool calls that pass schema validation | Before tool execution |
| Faithfulness | % of output claims entailed by tool observations (NLI) | After generation |
| Verifier catch rate | % of errors caught by verifier before output | Verifier stage |
| Cost-per-success | $ spent per successfully completed task (API calls + compute) | Billing logs |
| Latency p95 | 95th percentile wall-clock time from query to response | End-to-end |
When something goes wrong, debug using step-level spans – structured log events for each component:
[TRACE] task_id=abc-123 query="What is our electronics return policy?"
├─ [SPAN] router tool=search_corpus confidence=0.92 ✓
├─ [SPAN] tool_call args={query: "electronics return policy", top_k: 5}
│ └─ [EVENT] schema_valid=true latency_ms=45
├─ [SPAN] retrieval top_1_doc=doc-42 score=0.87
│ └─ [EVENT] snippet="Electronics may be returned within 15 days..."
├─ [SPAN] generator model=gpt-4 tokens_in=1200 tokens_out=85
│ └─ [EVENT] output="Our return policy allows electronics returns within 15 days."
├─ [SPAN] verifier faithfulness=0.95 citations_valid=true ✓
└─ [RESULT] action=answer latency_total_ms=320 cost=$0.004Where to instrument:
Start at step 1 when debugging: most failures are routing or retrieval errors.
Tip
Increasing autonomy increases compounding error.
Reliable tool-using systems require:
Typed tool contracts — JSON Schema-first interfaces with strict validation and retry. Every tool call is schema-validated before execution.
Routing policy — Rule-based, classifier, or LLM-based routing with confidence thresholds. Wrong tool is worse than no tool – always have a fallback path.
Verification layer — Planner/executor/verifier architecture. The verifier catches errors before they propagate. Approval gates for side-effectful tools.
Observability and evals — Trace every step (router, tool call, verifier). Monitor task success rate, tool selection accuracy, verifier catch rate, cost-per-success.
Design checklist:
Some Papers & Standards