Part 2: Self-Attention Mechanism
Self-attention projects each token into query, key, and value vectors
For input embedding \(\mathbf{x}_i\) :
\[
\mathbf{q}_i = \mathbf{x}_i W^Q, \quad \mathbf{k}_i = \mathbf{x}_i W^K, \quad \mathbf{v}_i = \mathbf{x}_i W^V
\]
\(W^Q, W^K, W^V \in \mathbb{R}^{d \times d_k}\) are learned projection matrices
Q, K, V are all computed from the same input—just different projections
This is a common source of confusion for students
The names come from database retrieval analogies
The QKV intuition: retrieval from a soft dictionary
Query (Q)
"What am I looking for?"
Represents the current token's information needs
Key (K)
"What do I contain?"
Represents what information this token offers
Value (V)
"What do I provide?"
The actual information to be retrieved
Analogy: Like a search engine where:
Query = your search terms
Key = document titles/metadata
Value = document contents
The query “searches” all keys to find relevant information
High query-key similarity means “this value is relevant to my query”
Unlike hard retrieval, attention computes a soft weighted average
Attention scores measure relevance via dot products
\[
\text{score}(i, j) = \mathbf{q}_i \cdot \mathbf{k}_j = \mathbf{q}_i \mathbf{k}_j^\top
\]
Computing Attention for "sat"
"The cat sat on the mat "
Token j
qsat · kj
Interpretation
The
0.8
Low relevance
cat
4.2
High—subject of "sat"
on
1.5
Medium
the
0.6
Low relevance
mat
3.8
High—location of "sat"
Dot product measures vector similarity (recall from vector semantics)
Larger dot product = more aligned representations = more relevance
“sat” attends strongly to its subject (“cat”) and location (“mat”)
Scaling prevents softmax saturation
\[
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V
\]
Why divide by \(\sqrt{d_k}\) ?
Without Scaling
scores = [8.5, 42, 3.2, 42]
softmax ≈ [0, 0.55, 0, 0.45]
Extreme values → near one-hot → vanishing gradients
With Scaling (dk =64)
scores/8 = [1.1, 5.3, 0.4, 5.2]
softmax ≈ [0.01, 0.49, 0, 0.49]
Moderate values → smooth distribution → healthy gradients
Dot products grow with dimension d_k (variance ≈ d_k for random vectors)
Scaling keeps variance ≈ 1 regardless of dimension
This is the “scaled” in “scaled dot-product attention”
The output is a weighted sum of value vectors
\[
\text{output}_i = \sum_{j=1}^n \alpha_{ij} \mathbf{v}_j
\]
where \(\alpha_{ij} = \text{softmax}\left(\frac{\mathbf{q}_i \cdot \mathbf{k}_j}{\sqrt{d_k}}\right)\)
Output for "sat" = weighted combination of all values
High attention weight → that position contributes more
Context is incorporated through this weighted aggregation
The output for each position is a mixture of all value vectors
This is computed for all positions in parallel
The attention weights determine the “mixture recipe”
Multi-head attention runs several attention operations in parallel
\[
\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)
\]
\[
\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O
\]
Each head has its own \(W^Q, W^K, W^V\) matrices
Heads learn to specialize in different relationship types
Typical: 8-16 heads with \(d_k = d_{\text{model}} / h\)
Different heads empirically learn different functions
Some track syntax, some track semantics, some track position
This is more expressive than one large attention mechanism
Different heads capture different linguistic relationships
Sentence: "The lawyer who the witness saw left"
Head A: Subject-verb
"left" attends strongly to "lawyer"
(captures who performed the action)
Head B: Relative clause
"saw" attends to "witness" and "lawyer"
(tracks nested clause structure)
Research finding: Attention heads in trained models often align with linguistic structure (“What Does BERT Look At? An Analysis of BERT’s Attention” Clark et al., 2019).
This specialization emerges from training, not explicit design
Some heads attend to adjacent tokens, others to distant ones
Probing studies reveal interpretable patterns in many heads
The feed-forward network processes each position independently
\[
\text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2
\]
or with GELU (used in GPT-2+):
\[
\text{FFN}(x) = \text{GELU}(xW_1)W_2
\]
No interaction between positions —applied identically to each token
Expands to 4× dimension, then projects back
Attention handles token interactions; FFN handles per-token processing
The 4x expansion is a common hyperparameter choice
FFN contains most of the model’s parameters
Residual connections and layer normalization stabilize training
Residual connection: \[
\text{output} = \text{sublayer}(x) + x
\]
Layer normalization: \[
\text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sigma + \epsilon} + \beta
\]
Why Residuals?
Gradient flows directly through addition
Enables training very deep networks
Each layer learns a "delta"
Why LayerNorm?
Normalizes activations per token
Stabilizes training dynamics
γ, β are learned parameters
Without residuals, gradients vanish in deep networks
LayerNorm is applied after (Post-LN) or before (Pre-LN) sublayers
Modern models mostly use Pre-LN for stability
Transformers maintain uniform dimensionality throughout
Embed
[n × d]
→
Block 1
[n × d]
→
Block 2
[n × d]
→
...
→
Block L
[n × d]
→
Output
[n × d]
Every token representation is \(d\) -dimensional throughout
No reshaping between layers—just stack and go
Simplifies architecture and enables easy layer scaling
This is different from CNNs where dimensions often change
Common values: d=768 (BERT-base), d=1024 (GPT-2), d=4096 (LLaMA-7B)
The uniformity is key to the architecture’s simplicity
Part 4: Causal Masking and Positional Encoding
Causal masking prevents attending to future tokens
For autoregressive generation (GPT, LLaMA, Claude):
\[
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top + M}{\sqrt{d_k}}\right)V
\]
where \(M_{ij} = \begin{cases} 0 & \text{if } j \leq i \\ -\infty & \text{if } j > i \end{cases}\)
The
cat
sat
on
The
✓
-∞
-∞
-∞
cat
✓
✓
-∞
-∞
sat
✓
✓
✓
-∞
on
✓
✓
✓
✓
Adding -∞ before softmax gives attention weight of 0
The mask is a lower-triangular matrix
This ensures training matches inference (where future doesn’t exist)
Causal masking enables the language modeling objective
Autoregressive factorization:
\[
P(x_1, \ldots, x_n) = \prod_{t=1}^n P(x_t \mid x_1, \ldots, x_{t-1})
\]
Predicting Each Token
P(The ) → predict first token
P(cat | The) → see "The"
P(sat | The cat) → see "The cat"
At each position, predict the next token using only past context
Training and inference use the same masking —no distribution shift
This is why GPT-style models are called “decoder-only”
BERT uses bidirectional attention (no mask) but can’t generate
The masking choice fundamentally determines what the model can do
Encoder-only vs. decoder-only architectures
Encoder-Only (BERT)
See all tokens at once
Good for classification, NER
Cannot generate text
Decoder-Only (GPT)
See only past tokens
Good for generation, chat
Dominant for modern LLMs
Encoder-Decoder (T5)
Encoder sees all input
Decoder generates output
Good for translation, summarization
The original transformer was encoder-decoder for translation
BERT showed encoder-only is great for understanding tasks
GPT showed decoder-only can do everything with enough scale
Type 1: Bidirectional self-attention in the encoder
Used in: Encoder layers of transformer, BERT
\[
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V
\]
Q, K, V all come from the same input
Encoder Input X
→
Q = XWQ
→
K = XWK
→
V = XWV
No masking: Every position can attend to every other position
Captures full bidirectional context for understanding tasks
This is the simplest form of attention
Perfect for encoding input where you have the full sequence
BERT uses only this type of attention
Bidirectional attention: full connectivity
The
cat
sat
down
The
✓
✓
✓
✓
cat
✓
✓
✓
✓
sat
✓
✓
✓
✓
down
✓
✓
✓
✓
All n² attention weights are computed
Each token’s output incorporates information from the entire sequence
Ideal for encoding where full context is available
This is what makes BERT good at understanding tasks
But you can’t generate with bidirectional attention—you’d be cheating
The encoder “reads” the entire input before producing any output
Type 2: Masked self-attention in the decoder
Used in: Decoder layers, GPT, LLaMA, Claude
\[
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top + M}{\sqrt{d_k}}\right)V
\]
Q, K, V all come from decoder input (with causal mask)
Decoder Input Y
→
Q = YWQ
→
K = YWK
→
V = YWV
+ Causal Mask M
Lower-triangular mask: Position i can only attend to positions ≤ i
Enables autoregressive generation
This is the attention pattern we discussed earlier with the mask
Critical for language modeling: predict next token from past only
Modern LLMs use only this type (they’re “decoder-only”)
Masked attention: lower-triangular connectivity
Le
chat
est
assis
Le
✓
✗
✗
✗
chat
✓
✓
✗
✗
est
✓
✓
✓
✗
assis
✓
✓
✓
✓
The decoder is generating a French translation
Each output token only sees previously generated tokens
Prevents “cheating” by looking at future outputs during training
The mask is applied before softmax by adding -∞ to blocked positions
During inference, future tokens literally don’t exist yet
Training with mask matches inference conditions
Type 3: Cross-attention connects decoder to encoder
Used in: Encoder-decoder models (original transformer, T5, BART)
Q from decoder, K and V from encoder
Decoder State Y
↓
Q = YWQ
\[
\text{CrossAttention}(Y, Z) = \text{softmax}\left(\frac{(YW^Q)(ZW^K)^\top}{\sqrt{d_k}}\right)(ZW^V)
\]
This is the key mechanism that connects encoder and decoder
The decoder “asks questions” about the encoder’s representation
Query comes from what we’re generating; K,V from what we’re translating
Cross-attention: decoder queries encoder
English (Encoder): "The cat sat down"
French (Decoder): "Le chat est assis"
Decoder↓ Encoder→
The
cat
sat
down
Le
0.85
0.10
0.03
0.02
chat
0.08
0.82
0.05
0.05
est
0.05
0.15
0.70
0.10
assis
0.02
0.08
0.50
0.40
Matrix is [decoder length × encoder length]—not square!
“assis” attends to both “sat” and “down” (semantic alignment)
Cross-attention learns soft word alignment during training
This is essentially a learned attention-based alignment model
No explicit masking needed—decoder can see all encoder positions
Comparing the three attention types
Encoder Self-Attn
Masked Self-Attn
Cross-Attention
Query source
Encoder input
Decoder input
Decoder state
Key/Value source
Encoder input
Decoder input
Encoder output
Masking
None
Causal (lower-tri)
None
Matrix shape
[n × n]
[m × m]
[m × n]
Purpose
Understand input
Generate output
Connect both
Where n = encoder sequence length, m = decoder sequence length
The three attention types form the complete encoder-decoder architecture
All use the same scaled dot-product attention formula
Only difference is where Q, K, V come from and whether masking is applied
Modern architectures often simplify
Encoder-Only (BERT)
Uses: Bidirectional self-attention only
Great for understanding/classification
Cannot generate text autoregressively
Examples: BERT, RoBERTa, ELECTRA
Decoder-Only (GPT)
Uses: Masked self-attention only
Great for generation and chat
Scales very well with compute
Examples: GPT, LLaMA, Claude
Key insight: Decoder-only models have become dominant because they’re simpler and scale better, while still being able to “understand” via in-context learning.
The original encoder-decoder was designed for translation
It turns out decoder-only can do translation too, given enough scale
Simplicity of decoder-only architecture enables easier scaling
Summary: Day 1 Key Takeaways
Attention replaces recurrence: Direct connections between all positions enable parallelism and long-range dependencies
QKV mechanism: Query asks, key answers, value provides—scaled dot-product measures relevance
Multi-head attention: Multiple heads capture diverse relationship types in parallel
Transformer blocks: Attention + FFN with residuals and layer norm, stacked L times
Causal masking: Prevents attending to future tokens, enabling autoregressive generation
Position encodings: Break permutation invariance to preserve sequence order
These concepts are foundational for understanding all modern LLMs
Tomorrow: How these architectures are trained at scale
Day 3: What capabilities emerge and how to evaluate them
Coming Up Next
LLM Training and Alignment
Self-supervised pretraining at scale
Instruction tuning (SFT)
Preference alignment with RLHF
Scaling laws and compute-optimal training
Reading:
J&M Chapter 8 (Transformers)
“Attention Is All You Need” (Vaswani et al., 2017)