Attention Recap, LLM Training and Generation

Robert Minneker

2026-02-10

Sources

Content derived from: J&M Ch. 8 (Transformers), Ch. 7 (Large Language Models)

Administrivia

Project checkpoints due Thursday!

Learning Objectives

By the end of this lecture, you will be able to:

Implement attention operations
Explain the self-supervised pretraining objective for LLMs
Describe what makes a Large Language Model “Large”

Recap: Attention Essentials

The Transformer architecture

Key components from last time:

Self-attention with Q, K, V projections
Multi-head attention for diverse relationships
Causal masking for autoregressive generation
Residual connections and layer normalization

Recall: Scaled dot-product attention

\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V \]

[n × d_k]

K^T

[d_k × n]

Scores

[n × n]

→

softmax

[n × n]

[n × d_v]

Output

[n × d_v]

Q, K, V are linear projections of the input: \(Q = XW^Q\), \(K = XW^K\), \(V = XW^V\)
Scaling by \(\sqrt{d_k}\) prevents softmax saturation
Output for each position = weighted sum of all value vectors

Recall: Why scale by \(\sqrt{d_k}\)? Preventing softmax saturation

For random vectors \(\mathbf{q}, \mathbf{k} \in \mathbb{R}^{d_k}\) with entries ~ \(\mathcal{N}(0,1)\): \(\text{Var}(\mathbf{q} \cdot \mathbf{k}) = d_k\)

Without Scaling (d_k = 64)

Dot products have std dev ≈ 8

      scores = [12, 8, 5, 1]
    

      softmax ≈ [0.982, 0.018, 0.0, 0.0]
    

Near one-hot → vanishing gradients for non-max tokens

With Scaling: scores / √64 = scores / 8

Dot products rescaled to std dev ≈ 1

      scaled  = [1.5, 1.0, 0.625, 0.125]
    

      softmax ≈ [0.439, 0.267, 0.183, 0.111]
    

Smooth distribution → healthy gradients for all tokens

Why \(\sqrt{d_k}\)? The variance argument

Assume \(q_i, k_i \sim \text{i.i.d.}\) with mean 0, variance 1. The dot product is:

\[ q \cdot k = \sum_{i=1}^{d_k} q_i \, k_i \]

Each term \(q_i k_i\) has: \(\;\mathbb{E}[q_i k_i] = 0\), \(\;\text{Var}(q_i k_i) = 1\)

By independence, the variance of the sum is:

\[ \text{Var}(q \cdot k) = \sum_{i=1}^{d_k} \text{Var}(q_i k_i) = d_k \]

So dot products grow as \(O(d_k)\). Dividing by \(\sqrt{d_k}\) restores unit variance:

\[ \text{Var}\!\left(\frac{q \cdot k}{\sqrt{d_k}}\right) = \frac{d_k}{d_k} = 1 \]

Recall: Multi-head attention and causal masking

Multi-head: Run \(h\) parallel attention operations with \(d_k = d_{model}/h\)

\[ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O \]

Causal masking: Prevent attending to future tokens

\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top + M}{\sqrt{d_k}}\right)V \]

\[ \text{where } M_{ij} = \begin{cases} 0 & j \leq i \\ -\infty & j > i \end{cases} \]

Multi-head

Each head learns different relationships

Causal mask

Lower-triangular: token i sees only ≤ i

Transformer block

Attn + FFN + residuals + LayerNorm

The decoder-only transformer block

The Pre-Norm Decoder-only Transformer

Part 1: Implementing Self-Attention

Tensor dimensions throughout attention

Input: \(X \in \mathbb{R}^{B \times n \times d_{model}}\)

Step	Operation	Shape
Input	X	(B, n_tok, d_model)
Project Q	Q = X @ W_Q	(B, n_tok, d_model)
Attention scores	A = Q @ K.T	(B, n_tok, n_tok)
Scale + Softmax	A = softmax(A / √d)	(B, n_tok, n_tok)
Output	O = A @ V	(B, n_tok, d_model)

B = batch size, n_tok = sequence length, d_model = embedding dimension
The attention matrix is n_tok × n_tok—this is where O(n²) comes from

Attenion as a block diagram

Multi-head attention — why split heads?

Problem: Single attention head has limited expressiveness

Solution: Run h parallel attention operations with smaller dimensions

Single Head

        d_model = 512

        1 attention operation

→

8 Heads

        d_head = 512/8 = 64

        8 parallel attention ops

Same total parameters: one 512-dim head ≈ eight 64-dim heads
Each head can learn different relationship types

Part 2: What Makes a Language Model “Large”

A language model predicts probability distributions over tokens

Core function:

\[ P(x_t \mid x_1, \ldots, x_{t-1}; \theta) \]

The cat sat on the

→

Transformer
(θ = parameters)

→

P(next token)

      mat: 0.15

      floor: 0.12

      rug: 0.08

      ...

The model maps context → probability distribution over vocabulary

“Large” refers to both parameters and training data

Model	Parameters	Training Tokens	Year
GPT-2	1.5B	40B	2019
GPT-3	175B	300B	2020
LLaMA-2	70B	2T	2023
GPT-4	~1.8T*	~13T*	2023

*Estimated, not officially disclosed

Scale has grown ~1000× in 4 years

The three-stage training pipeline

Pretraining

Next-token prediction on massive corpora

Trillions of tokens
Weeks of compute

→

Instruction Tuning

Supervised fine-tuning on instruction-response pairs

Thousands of examples
Hours of compute

→

RLHF

Reinforcement learning from human feedback

Preference data
Days of compute

Slide from Andrej Karpathy: State of GPT

Part 3: Pretraining at Scale

The pretraining objective: predict the next token

Causal Language Modeling (CLM) loss:

\[ \mathcal{L}(\theta) = -\sum_{t=1}^{T} \log P_\theta(x_t \mid x_1, \ldots, x_{t-1}) \]

Training Example: "The quick brown fox jumps"

Context	Target	Loss contribution
<bos>	The	−log P(The \| <bos>)
The	quick	−log P(quick \| The)
The quick	brown	−log P(brown \| The quick)
The quick brown	fox	−log P(fox \| The quick brown)
The quick brown fox	jumps	−log P(jumps \| ...)

Self-supervision: the data labels itself

Traditional Supervised Learning

Image → Human labels "cat"

Requires expensive manual annotation

Self-Supervised (LLMs)

"The cat" → Next word is "sat"

Labels come from the text itself—free and unlimited

Key insight: The structure of language provides free supervision at massive scale

Next-token prediction implicitly learns many skills

To predict the next token well, the model must learn:

Grammar & Syntax

"She runs" not "She run"

World Knowledge

"Paris is the capital of France"

Reasoning Patterns

"If A then B. A is true. Therefore B"

Style & Tone

Formal vs. casual, technical vs. simple

The simple objective captures complex structure

GPT-2 Paper

Scaling laws describe predictable improvement with resources

Empirical finding (Kaplan et al., 2020; Hoffmann et al., 2022):

\[ L(N, D) \approx \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} + L_\infty \]

N = Parameters

More parameters → lower loss

D = Data (tokens)

More data → lower loss

C = Compute

More FLOPs → lower loss

Loss decreases as a power law in each resource

Constants: From the Kaplan et al. paper, the fitted values were approximately:

Nc≈8.8 E 13 Dc≈5.4 E 13 αN≈0.076 αD≈0.095

The loss function L(N,D)L(N, D) L(N,D) has three terms:

(Nc/N)αN: Loss contribution from limited parameters. As N grows, this term shrinks — bigger models do better. NcN_c Nc and αN_N αN are empirically fitted constants.

(Dc/D)αD: Loss contribution from limited data. As D (tokens) grows, this term shrinks.

L∞: The irreducible loss — the theoretical floor you’d hit with infinite parameters and infinite data. This represents inherent noise/entropy in language itself. You can never get loss below this.

The key insight is that both terms follow power laws — straight lines on a log-log plot. Performance doesn’t improve linearly with resources; you get diminishing returns, but the returns are predictable. - This is why “just scale it up” has been so effective - The relationships are smooth and predictable - Enables planning large training runs before starting

Scaling law curves (Kaplan et al., 2020)

Chinchilla scaling: balance parameters and data

Key insight (Hoffmann et al., 2022):

For a fixed compute budget, there’s an optimal ratio:

\[ N_{\text{opt}} \propto C^{0.5}, \quad D_{\text{opt}} \propto C^{0.5} \]

Rule of thumb: Train on ~20 tokens per parameter

Undertrained

70B params, 300B tokens

GPT-3 (4 tokens/param)

Compute-Optimal

70B params, 1.4T tokens

Chinchilla (20 tokens/param)

Inference-Optimal

7B params, 2T tokens

LLaMA (285 tokens/param)