2026-03-03
Key References: Liang et al. (2023) — HELM; Zheng et al. (2023) — Chatbot Arena; Sainz et al. (2023) — Contamination
By the end of this lecture, you will be able to:
Your team built a news summarization system. Your PM asks: “Is it ready to ship?”
Here’s what it produces for the article “Talks in Geneva ended without agreement on climate funding targets”:
Vote: Ship / Don’t ship / Need more information?
Running example: Evaluating a news summarization system — how do you score “Talks in Geneva ended without agreement on climate funding targets”?
e.g. BLEU-optimized systems produce disfluent text; RLHF models learn to exploit reward models rather than genuinely satisfy users.
Quick question: How would you game ROUGE for summarization?
Warning
Models can “game” any single metric — multi-dimensional evaluation is essential. The evaluation challenge motivates everything else in this lecture.
| Metric | Task | Formula (intuition) | Key Limitation |
|---|---|---|---|
| Perplexity | Language modeling | \(2^{-\frac{1}{N}\sum \log_2 p(w_i)}\) | Rewards fluency, ignores factuality |
| BLEU | Translation, summarization | n-gram precision vs. reference | Ignores semantics; penalizes valid paraphrases |
| ROUGE-L | Summarization | Longest common subsequence | Rewards surface overlap, misses abstraction |
| BERTScore | Open-ended generation | Embedding similarity | Better with paraphrases; still imperfect |
| Human Judgment | Any | Likert scales, pairwise comparison | Expensive, slow, annotator disagreement |
Reference: “Talks in Geneva ended without agreement on climate funding targets.”
Vote now: Rank these best → worst. Which would you ship?
| Candidate | BLEU-4 | ROUGE-L | BERTScore | Actual Quality |
|---|---|---|---|---|
| A (good paraphrase) | Low | Low | High | Best |
| B (factually wrong!) | Very High | Very High | High | Worst |
| C (extractive/incomplete) | Medium | Medium | Medium | Incomplete |
Warning
BLEU and ROUGE rank the factually wrong summary (B) highest because they only measure surface overlap. BERTScore handles paraphrases better but still can’t catch the negation flip. Which failure mode happened here? Surface overlap gaming + negation blindness.
Activity: Design a better metric (60 sec)
Challenge: Propose a metric or test that would catch the negation flip in Candidate B (“without” → “with” agreement).
Think about: What would your metric actually compute? What input does it need beyond the candidate text?
Your proposals likely fall into one of these categories — and each has a catch:
| Benchmark | What It Tests | Design Innovation |
|---|---|---|
| MMLU | 57-subject knowledge (STEM, humanities, social sciences) | Multiple-choice; easy to score |
| HELM | Multi-metric holistic evaluation | Evaluates accuracy + calibration + fairness + toxicity simultaneously |
| BIG-Bench | 204 diverse tasks from 400+ researchers | Crowdsourced; tests emergent abilities |
| Chatbot Arena | Open-ended human preference | Head-to-head blind comparisons; Elo ratings |
Well-designed benchmarks need: Breadth across tasks, held-out test sets, regular refreshing, resistance to gaming
The problem: LLM pretraining corpora are massive (trillions of tokens from the internet). Benchmark datasets are also on the internet. \(\Rightarrow\) Test data leaks into training data.
| Detection Method | How It Works |
|---|---|
| n-gram overlap | Search for exact test set sequences in training data |
| Canary strings | Plant unique tokens in benchmarks; test if models generate them |
| Membership inference | Test if model assigns suspiciously high probability to test examples |
| Temporal splits | Only use test data created after the model's training cutoff |
Activity: The contamination verdict (2 min)
Scenario: Model X scores 92% on MMLU (state-of-the-art!) but drops to 78% on questions published after its training cutoff.
Step 1 — Vote (30 sec): What do you conclude?
Step 2 — Pair justification (60 sec): Defend your choice to a neighbor.
Step 3 — Debrief: What additional evidence would change your answer?
This same logic applies to canary strings — if the model generates a planted token sequence, that’s an extreme-probability event under H₀.
A model’s perplexity on 6 benchmark examples (lower = more “familiar”):
| Example | Perplexity |
|---|---|
| "What is the capital of France?" | 3.2 |
| "Calculate ∫₀¹ x² sin(πx) dx" | 45.7 |
| "According to Hendrycks (2021), Table 3, row 5..." | 1.8 |
| "Name three causes of the 1997 Asian financial crisis" | 12.4 |
Vote: Which example(s) look contaminated? Why?
Two annotators rate 100 summaries as “Good” or “Bad”:
| B: Good | B: Bad | |
|---|---|---|
| A: Good | 85 | 5 |
| A: Bad | 5 | 5 |
Warning
κ corrects for base-rate imbalance. High raw agreement can mask the fact that annotators mostly agree by default, not by judgment.
Inter-annotator agreement measures protocol quality — low agreement means the task is ambiguous, not that annotators are bad:
The idea: Replace human annotators with an LLM prompted to evaluate outputs.
Known biases in LLM judges:
Prompt to judge model: “Which response better answers the question ‘What causes tides?’”
| Trial | Order Shown | Judge's Verdict |
|---|---|---|
| 1 | Response A first, B second | "A is better" |
| 2 | Response B first, A second | "B is better" (same response, now shown first!) |
The judge flipped its preference purely based on presentation order — the same response wins when shown first.
Warning
Mitigation: Run every comparison twice (AB and BA order). Only count as a win if the same response wins both times; otherwise mark as a tie.
How it works:
Concept Check
Let’s see Arena in action!
Red-teaming = systematically attempting to elicit failures, unsafe outputs, or unexpected behaviors.
Red-teaming is not just “trying to break the model” — it follows a structured methodology:
Two prompt templates for the same summarization task on the same test set:
Activity: Design your evaluation strategy (3 min)
Scenario: You’re shipping the news summarization system from today’s running example.
Step 1 — Vote (30 sec): Which evaluation approach would you prioritize first?
Step 2 — Pair justification (60 sec): Turn to a neighbor. Defend your choice.
Step 3 — Debrief: What did each choice trade off? (Cost vs. reliability vs. speed)
| If you're shipping... | Then do this |
|---|---|
| Any LLM system | Use multiple metrics — never trust a single number. Report confidence intervals. |
| A generation task | BERTScore + human pairwise eval. BLEU/ROUGE only as sanity checks. |
| Against a benchmark | Check for contamination (temporal splits, membership inference). Treat scores as upper bounds. |
| LLM-as-a-judge eval | Randomize order, run AB + BA, use multiple judges, calibrate against human labels. |
| To production users | Red-team first (safety + capability + robustness). Fix seeds, log API versions, run prompt sensitivity. |
Tip
Evaluation is not a one-time gate — it’s a continuous practice. The field is evolving rapidly; contamination detection and LLM-as-a-judge are active research areas.
Everything we covered today evaluates a single model turn. But LLM agents act over multiple steps — browsing, coding, calling APIs, revising. This breaks our evaluation assumptions:
| Single-Turn Eval | → | Agentic Eval |
|---|---|---|
| Input → Output → Score | Goal → Trajectory (plan, tool calls, revisions) → Outcome | |
| One correct answer (or small set) | Many valid trajectories to the same goal | |
| Errors are local (one bad generation) | Errors compound — one bad step derails the whole plan | |
| Static benchmark dataset | Live environments (web, code execution, databases) |