2026-03-05
Key References: Bender et al. (2021) — Stochastic Parrots; Bai et al. (2022) — Constitutional AI; Weidinger et al. (2022) — Risk Taxonomy
By the end of this lecture, you will be able to:
Every topic in today’s lecture maps to one of these layers:
| Topic | Layer | Update cost |
|---|---|---|
| Hallucination mitigation | Runtime | Code deploy |
| Bias & fairness | All layers | Varies |
| Alignment (RLHF, CAI) | Training | Retraining |
| Guardrails & filtering | Runtime | Code deploy |
| Monitoring & logging | System | Config change |
| Governance & regulation | System | Config change |
Safety is not a single technique. It is a property of the entire system.
Root cause: LLMs optimize next-token likelihood: \(\arg\max P_\theta(\text{text} \mid \text{prompt})\). This objective rewards fluent continuation, not factual correctness — compounded by training data artifacts, high-temperature decoding, and knowledge gaps.
Discussion: Which defense from the mitigation ladder would catch this?
Concept Check
A medical chatbot confidently states a drug interaction that doesn’t exist. Intrinsic or extrinsic hallucination? Which rung of the ladder would you require?
Let’s try this live
Ask: “What paper introduced attention? Cite the authors and year exactly.”
What changes between runs? Check the log.
These sources produce two categories of harm:
Key benchmarks: WinoBias (gender in coreference), CrowS-Pairs (stereotypes), BBQ (QA bias)
Auditing for bias: Run counterfactual pairs — swap demographic attributes on identical inputs and measure output differentials. Systematic differences reveal bias — but counterfactual pairs only catch one dimension. Bias can also stem from labeling, reward modeling, deployment context, and how users interpret outputs. No single audit method covers all sources.
Implication: Safety is not a feature you add — it is a systems engineering discipline.
State Change: Defense in Depth
Safe deployment requires layered mechanisms: content filtering catches obvious harms, alignment shapes model values, and adversarial robustness handles adaptive attackers.
Key insight: Filtering rejects bad outputs after generation; alignment prevents them from being generated in the first place.
| Attack | Technique | Example |
|---|---|---|
| Prompt Injection | Override system instructions | "Ignore previous instructions..." |
| Jailbreaking | Obfuscated or role-play bypass | "DAN" (Do Anything Now) |
| Multi-Turn Erosion | Gradually erode safety over conversation | Bing Chat "Sydney" incident |
Defense-in-depth — assume any single layer can be breached:
Let’s try this live
What’s going on here?
Alignment shapes behavior. Guardrails enforce rules.
def llm_pipeline(prompt):
# Input guardrails
if detect_prompt_injection(prompt):
return "Request rejected"
if contains_pii(prompt):
return "Sensitive information detected"
# Generation
response = llm.generate(prompt)
# Output guardrails
if toxicity_classifier(response) > threshold:
return "Response filtered"
return responseGuardrails are deterministic checks around probabilistic generation.
Discussion (2 min)
Can we ever fully “solve” adversarial robustness in open-ended language models, or is it fundamentally an arms race? What does this imply for deployment strategy?
Live challenge (3 minutes)
Click the “Part 3: Adversarial” preset. Injection detection and topic restriction are ON.
Your goal: Get the model to do something it shouldn’t. Suggest prompts and I’ll type them in.
Watch the log after each attempt.
State Change: From Research to Production
Building a safe model is necessary but insufficient. Deploying it responsibly requires engineering for cost, latency, monitoring, and governance.
Latency budget for a RAG-augmented request:
Optimization order for LLM systems:
Streaming masks perceived latency: users see tokens as they’re generated
Concept Check
You need to deploy a 70B-parameter LLM on a single A100 GPU (80GB VRAM). The model is ~280GB in FP32. What combination of techniques makes this feasible?
Core principle: AI governance follows risk-based design — higher-risk systems require stronger oversight.
| Tool | Purpose |
|---|---|
| Model Cards | Document behavior, limitations, intended uses, demographic performance |
| Transparency Reports | Ongoing disclosure of incidents, drift, emerging risks |
| Regulatory Compliance | EU AI Act risk categories, NIST AI RMF |
| Environmental Disclosure | Carbon footprint and energy use per training/serving run |
Important
Safety is part of infrastructure, not just modeling.
NYT v. OpenAI (2023): ChatGPT reproduced Pulitzer Prize-winning articles near-verbatim when prompted with opening lines — raising the question of whether models memorize copyrighted content.
Hallucination is a fundamental LLM failure mode — distinguish intrinsic vs. extrinsic; mitigate with a defense ladder from abstention to human-in-the-loop
Bias enters at every pipeline stage — data, labeling, reward modeling, deployment, and interpretation. Measurement requires multiple methods (counterfactual pairs, benchmarks, user studies); mitigation spans training, inference, and system design
Safety requires defense-in-depth — Constitutional AI shapes values; content filtering, aligned LLMs, output filters, and monitoring each address different threat layers
Adversarial robustness is an ongoing arms race — no single defense suffices; continuous red-teaming is essential
Efficient deployment uses quantization, distillation, and batching to meet cost/latency requirements without sacrificing safety
Governance follows risk-tiered regulation — model cards, transparency reports, and regulatory compliance are infrastructure, not paperwork
Congratulations on (almost) completing CSE 447/517!
This quarter we covered the full arc of modern NLP:
Key message: Building capable NLP systems is inseparable from building safe, fair, and accountable ones. The technical and ethical dimensions are deeply intertwined.
Next week: We’ll have guest speakers talking about FlexOlmo (Kevin from Ai2, MoE, Unlearning, Responsible data use), Music + NLP (Praveer your TA!, Diffusion models, Start up progress in the space)
Thank you for a great quarter!