2026-02-26
Key References: Elhage et al. (2021) — Transformer Circuits; Olsson et al. (2022) — Induction Heads; Bricken et al. (2023) — Monosemanticity; Templeton et al. (2024) — Scaling Monosemanticity; nostalgebraist (2020) — Logit Lens
By the end of this lecture, you will be able to:
Three levels of interpretability questions, from surface to mechanism:
This lecture builds bottom-up: start from mechanisms (circuits), develop scaling tools (probes, SAEs), then ask what practice demands.
From tokens to circuits
We start at the simplest possible transformer computation — no attention, no MLPs — and build toward multi-layer circuits. Each step adds exactly one new mechanism.
The simplest “circuit” — skip all attention heads and MLPs, go directly from input to output:
| Input → | the | cat | sat | on |
|---|---|---|---|---|
| Top prediction | cat | is | on | the |
Source: Elhage et al., “A Mathematical Framework for Transformer Circuits” (2021)
Source: Elhage et al., “A Mathematical Framework for Transformer Circuits” (2021)
Source: Elhage et al., “A Mathematical Framework for Transformer Circuits” (2021)
Source: Elhage et al., “A Mathematical Framework for Transformer Circuits” (2021)
Each attention head decomposes into two independent circuits:
The induction head circuit — the mechanistic basis of in-context learning:
Pair Activity (2 min)
Given this (simplified) bigram table from \(W_U^T W_E\):
| Input ↓ / Output → | the | cat | sat | on |
|---|---|---|---|---|
| the | 0.1 | 2.4 | 0.8 | 0.3 |
| cat | 1.1 | 0.2 | 2.7 | 0.6 |
| sat | 0.5 | 0.3 | 0.1 | 3.1 |
| on | 2.9 | 0.7 | 0.4 | 0.2 |
When one head’s output becomes another head’s input, three types of composition arise:
Example: The induction circuit is K-composition — the previous-token head writes to the residual stream, and the induction head reads it via its key, changing which positions match the current query.
As models grow, the number of possible interaction paths explodes:
| Model | Layers | Heads/Layer | Total Heads | 2-Head Paths | All Paths |
|---|---|---|---|---|---|
| GPT-2 Small | 12 | 12 | 144 | ~10K | ~106 |
| GPT-2 XL | 48 | 25 | 1,200 | ~700K | ~1026 |
| Llama 70B | 80 | 64 | 5,120 | ~13M | ~10100+ |
From circuits to scalable methods
Since we can’t trace every circuit by hand, we need tools that summarize internal computation. Each tool trades off fidelity for scalability.
Idea (nostalgebraist, 2020): At every layer, project the residual stream through the unembedding matrix to see what the model would predict if it stopped here.
Setup: Train a linear classifier on frozen embeddings: \(y_i = \text{softmax}(Wh_i + b)\)
Individual + Compare (3 min)
For the input “The Eiffel Tower is in”, sketch what you think the logit lens top-1 prediction would be at each layer:
| Layer | Embedding | Layer 2 | Layer 6 | Layer 10 | Layer 12 | Final |
|---|---|---|---|---|---|---|
| Top-1 | ? | ? | ? | ? | ? | ? |
| Confidence | low | ? | ? | ? | ? | high |
Superposition problem: Models encode more features than they have dimensions, causing features to overlap.
SAE objective — reconstruct activations with sparse features:
\[\mathcal{L} = \| h - \hat{h} \|_2^2 + \lambda \| z \|_1\]
where \(\hat{h} = W_{dec} \cdot \text{ReLU}(W_{enc} \cdot h + b_{enc}) + b_{dec}\) and \(z\) are the encoder activations.
Activation patching: Replace one activation with a value from a different context; measure the output change.
Path patching extends this to trace specific information routes:
From understanding to deployment
Mechanistic knowledge is only useful if it changes what you do. This section asks: when does circuit-level understanding actually help in practice?
Core idea: Find a direction in activation space that corresponds to a concept, then add or subtract it at inference time.
\[v_{\text{steer}} = \text{mean}(h_{+}) - \text{mean}(h_{-})\]
where \(h_+\) are activations from positive examples and \(h_-\) from negative examples.
Examples from representation engineering (Zou et al., 2023):
| Traditional Red Teaming | Mechanistic Red Teaming | |
|---|---|---|
| Approach | Craft adversarial inputs, observe outputs | Identify vulnerable circuits, predict failure modes |
| Evidence | Behavioral (found an exploit) | Mechanistic (understand why it fails) |
| Scalability | Scales with human effort / automation | Currently limited to smaller models / specific circuits |
| Fixes | Patch the input (filter, RLHF) | Patch the circuit (ablation, steering, editing) |
| Maturity | Production-ready | Research stage |
Pair Debate (3 min)
For each scenario, argue: is behavioral testing sufficient, or do you need mechanistic understanding?
For each: What could go wrong? What level of evidence (from the hierarchy) do you need?
A realistic 4-step workflow for incorporating interpretability into a deployed LLM system:
Where are we, really?
Interpretability has made remarkable progress. It also has fundamental unsolved problems. This section is about being honest about both.
Rapid Vote — Whole Class (2 min)
For each claim, vote: Works (solid evidence), Open (active research, unclear), or Wishful (no good evidence yet).
(Show of hands for each. Instructor reveals suggested answers after all votes.)
References & Resources
March 3: Evaluation and Benchmarking