2026-02-17
Content derived from: J&M Ch. 7 (Large Language Models)
Today’s Lecture:
By the end of this lecture, you will be able to:
Throughout this lecture, we’ll develop prompts for a concrete scenario: a medical triage assistant that helps patients describe symptoms and routes them to appropriate care.
State Change: Learning Without Training
Finetuning adapts model weights. In-context learning adapts model behavior through the prompt alone, keeping all parameters frozen.
The model processes a prompt \(p\) and input \(x\) to produce output \(y\) with frozen weights \(\theta\):
\[ y = \mathrm{LLM}_\theta(p, x) \]
| Prompt Type | What the Model Sees | Example |
|---|---|---|
| Zero-shot | Task description only | Translate to French: 'cat' |
| Few-shot | k input-output demonstrations | cat → chat, dog → chien, house → ? |
The output distribution is sensitive to prompt structure:
\[ P(y \mid x, p_1) \neq P(y \mid x, p_2) \quad \text{if} \quad p_1 \neq p_2 \]
Prompt sensitivity is real: Zhao et al. (2021) showed that simply reordering few-shot examples can swing GPT-3 accuracy from near-chance to near-SOTA on the same benchmark — a 30+ percentage point difference from permutation alone.
Think-Pair-Share: Choosing a Prompting Strategy
For our triage assistant, would you use zero-shot or few-shot prompting? Consider:
Discuss with a neighbor for 2 minutes, then share your recommendation.
State Change: Beyond Direct Answers
Standard prompting asks for a final answer. Chain-of-thought prompting elicits intermediate reasoning steps, dramatically improving performance on complex tasks.
CoT biases the model toward generating a reasoning trace rather than a single direct answer:
Standard vs. chain-of-thought prompting from Wei et al. (2022)
GPT-4 and the Bar Exam
GPT-4 scored in the 90th percentile on the Bar Exam (vs. GPT-3.5’s 10th percentile). Takeaway: Benchmarks can overstate reasoning ability — multiple-choice format may not reflect true open-ended legal reasoning.
| Strategy | Compute Cost | Robustness | Best For |
|---|---|---|---|
| Basic CoT | 1× | Low | Simple multi-step problems |
| Self-Consistency | N× | High | When you need reliable answers (triage!) |
| Tree-of-Thought | N×M× | Highest | Creative/open-ended exploration |
| Least-to-Most | K× | Medium | Hierarchical decomposition |
Concept Check
Why does self-consistency improve over a single chain-of-thought? Under what conditions might it fail? Think about the relationship between answer diversity and aggregation quality.
Discussion (2 min)
Consider a multi-hop science question: “What element has the highest electronegativity, and what compounds does it commonly form?” Would you use basic CoT, self-consistency, or least-to-most prompting? Why?
State Change: Controlling Output Format
Reasoning strategies improve answer quality. Structured prompting goes further by constraining the format of model outputs, enabling integration with downstream systems.
By specifying output schemas, we reduce output entropy and enable robust downstream parsing:
\[ f_{\text{LLM}}(x; P) \in \mathcal{S} \]
where \(\mathcal{S}\) is the set of valid outputs (e.g., all well-formed JSON objects).
\[ \text{Prompt}(q) = \text{concat}\left(\mathcal{R}(q),\ q\right) \]
Concept Check
A biomedical QA system retrieves 20 PubMed abstracts for each query, but the LLM’s context window is 4,096 tokens. What strategies could you use to handle this? What are the trade-offs?
State Change: From Answering to Acting
Prompting strategies so far produce text. Agentic workflows enable LLMs to reason, plan, and invoke external tools to complete tasks in the real world.
LLMs alternate between generating intermediate thoughts and invoking external tools:
\[ \pi: (x, h) \rightarrow (r, a) \]
where \(r\) is a reasoning step, \(a\) is an action (API/tool call), and \(h\) is the interaction history.
ReAct framework comparison from Yao et al. (2023): reasoning-only vs. action-only vs. ReAct
For our triage assistant, a simple agent loop might:
For a pipeline of \(n\) steps where each step has error probability \(p_i\):
\[ P_{\text{failure}} = 1 - \prod_{i=1}^n (1 - p_i) \]
Concept Check
If each step in a 5-step agentic pipeline has a 10% error rate, what is the overall failure probability? What does this imply about designing agentic systems?
In-context learning enables task adaptation through prompts alone, with no parameter updates — the prompt serves as an inductive bias over frozen weights
Prompt design matters: wording, formatting, and example order significantly affect model behavior; principled prompt engineering is essential
Chain-of-thought prompting elicits step-by-step reasoning, yielding large accuracy gains on arithmetic, logic, and multi-hop tasks
Advanced strategies (self-consistency, tree-of-thought, least-to-most) trade additional compute for more reliable reasoning
Structured prompting constrains outputs to machine-readable formats, enabling integration with production systems
Agentic workflows extend LLMs from text generators to planners and actors, but face compounding errors, hallucination, and prompt injection risks
3 prompting heuristics you can apply tomorrow:
Finetuning & Adaptation (Feb 19)
When prompting hits its limits — insufficient domain knowledge, inconsistent behavior, or the need for persistent style changes — we turn to training-time adaptation.