2026-01-20
Content derived from: J&M Ch. 3
\[P(w_1, w_2, \ldots, w_n)\]
\[P(w_1, w_2, \ldots, w_n) = \prod_{i=1}^{n} P(w_i \mid w_1, \ldots, w_{i-1})\]
Example: \(P(\text{the cat sat}) = P(\text{the}) \times P(\text{cat} \mid \text{the}) \times P(\text{sat} \mid \text{the cat})\)
But: Each conditional still depends on unbounded history!
Key insight: Approximate by limiting history (Markov assumption)
\[ P(\mathbf{w}) \approx \prod_{n=1}^T P(w_n \mid w_{n-(N-1)}, \ldots, w_{n-1}) \]
Sentence: The cat sat on the mat
Unigram: The cat sat on the ? P(mat)
Bigram: The cat sat on the ? P(mat | the)
Trigram: The cat sat on the ? P(mat | on, the)
Concept Check
If you have a vocabulary of 10,000 words, how many possible bigrams exist? How many trigrams? What does this imply for data requirements?
\[ P(w_n \mid w_1, \ldots, w_{n-1}) \approx P(w_n \mid w_{n-k}, \ldots, w_{n-1}) \]
"The keys to the cabinet in the corner of the room are on the table."
\[ P_{\text{MLE}}(w_n \mid w_{n-1}, \ldots, w_1) = \frac{C(w_1, \ldots, w_{n-1}, w_n)}{C(w_1, \ldots, w_{n-1})} \]
| I want | 2 |
| want to | 2 |
| to eat | 1 |
| to sleep | 1 |
| to go | 1 |
| I need | 1 |
| need to | 1 |
Concept Check
If your training corpus never contains “quantum computing,” what probability will an MLE bigram model assign to “quantum computing” in test data? What’s wrong with this?
\[ P_{\text{Laplace}}(w_i|w_{i-1}) = \frac{C(w_{i-1}, w_i) + \alpha}{C(w_{i-1}) + \alpha |V|} \]
\[ c^*(r) = \frac{(r+1) N_{r+1}}{N_r} \]
\[ P_{\text{KN}}(w_i|w_{i-1}) = \frac{\max(C(w_{i-1}, w_i) - D, 0)}{C(w_{i-1})} + \lambda(w_{i-1}) P_{\text{continuation}}(w_i) \]
\[P_{\text{interp}}(w_n | w_{n-2}, w_{n-1}) = \lambda_1 P_1(w_n) + \lambda_2 P_2(w_n | w_{n-1}) + \lambda_3 P_3(w_n | w_{n-2}, w_{n-1})\]
\[\hat{\lambda} = \arg\max_{\lambda} \prod_{w \in \text{dev}} P_{\text{interp}}(w | \text{context})\]
| Approach | Strategy | When to use lower-order |
|---|---|---|
| Interpolation | Always mix all orders | Every prediction |
| Backoff | Use highest order available | Only when higher-order count = 0 |
Concept Check
If λ₁ = 0.1, λ₂ = 0.3, λ₃ = 0.6, and you encounter a context never seen in training, which model contributes most to the prediction? What if the trigram has a reliable estimate?
| the | cat | sat | dog | |
| the | 0 | 42 | 3 | 18 |
| cat | 1 | 0 | 25 | 0 |
| sat | 5 | 0 | 0 | 0 |
| dog | 0 | 0 | 12 | 0 |
| the | cat | sat | dog | |
| the | 1 | 43 | 4 | 19 |
| cat | 2 | 1 | 26 | 1 |
| sat | 6 | 1 | 1 | 1 |
| dog | 1 | 1 | 13 | 1 |
| the | cat | sat | dog | |
| the | +1 | +1 | +1 | +1 |
| cat | +1 | +1 | +1 | +1 |
| sat | +1 | +1 | +1 | +1 |
| dog | +1 | +1 | +1 | +1 |
| the | cat | sat | dog |
| 0 | .67 | .05 | .29 |
| the | cat | sat | dog |
| .01 | .64 | .06 | .28 |
| Structure | Memory | Lookup | Best For |
|---|---|---|---|
| Trie | High (prefix sharing) | \(O(n)\) | Prefix queries, smoothing |
| Hash Table | Depends on load | \(O(1)\) avg | Flat access, fixed n |
Concept Check
When would you prefer a trie over a hash table for n-gram storage? When would hash tables be better?
Concept Check
Your phone suggests “you” after “thank.” Why might it not suggest “quantum”? What does this reveal about n-gram predictions?
\[ P(y) = \prod_{t=1}^{T} P(y_t \mid y_{t-n+1}, \ldots, y_{t-1}) \]
\[ \text{Score} = \lambda_1 \log P_{\text{align}}(y \mid x) + \lambda_2 \log P_{\text{LM}}(y) \]
\[ \mathrm{Perplexity}(M) = 2^{-\frac{1}{N} \sum_{i=1}^{N} \log_2 P_M(w_i \mid w_{1:i-1})} \]
\[P(W) = P(w_1, w_2, \ldots, w_N) = \prod_{i=1}^{N} P(w_i \mid w_1, \ldots, w_{i-1})\]
\[\log P(W) = \sum_{i=1}^{N} \log P(w_i \mid w_{1:i-1})\]
\[\frac{1}{N} \sum_{i=1}^{N} \log P(w_i \mid w_{1:i-1})\]
\[H(W) = -\frac{1}{N} \sum_{i=1}^{N} \log_2 P(w_i \mid w_{1:i-1})\]
\[\text{PP}(W) = 2^{H(W)} = 2^{-\frac{1}{N} \sum_{i=1}^{N} \log_2 P(w_i \mid w_{1:i-1})}\]
\[\text{PP}(W) = \sqrt[N]{\prod_{i=1}^{N} \frac{1}{P(w_i \mid w_{1:i-1})}}\]
Intrinsic Evaluation
|
Extrinsic Evaluation
|
| Model | Penn Treebank Perplexity |
|---|---|
| Trigram | ~140 |
| LSTM | ~80 |
| Transformer | ~20 |
Concept Check
A model has perplexity 50 on news text but perplexity 200 on social media. What might explain this? Is the model “bad”?
CSE 447/517 26wi - NLP