2026-01-27
Content derived from: J&M Ch. 5
Important
\[ \text{contexts}(w_1) \approx \text{contexts}(w_2) \implies \text{meaning}(w_1) \approx \text{meaning}(w_2) \]
\[ \text{sim}(w_1, w_2) = \cos(\theta) = \frac{\mathbf{v}_{w_1} \cdot \mathbf{v}_{w_2}}{\|\mathbf{v}_{w_1}\| \|\mathbf{v}_{w_2}\|} \]
\[\vec{v}_{\text{grape}} - \vec{v}_{\text{vine}} + \vec{v}_{\text{tree}} \approx \vec{v}_{\text{apple}}\]
Why this matters:
\[\text{TF-IDF}(w, d) = \text{TF}(w, d) \times \log\frac{|D|}{\text{DF}(w)}\]
Limitation: TF-IDF captures document-level topicality, not fine-grained word similarity
| Doc 1 wine review |
Doc 2 botany text |
Doc 3 recipe |
|
|---|---|---|---|
| grape | 15 | 8 | 3 |
| wine | 25 | 2 | 5 |
| tree | 0 | 18 | 1 |
| Window ±1: | fresh, juice | → syntactic neighbors |
| Window ±5: | The, fresh, juice, tastes, great | → topical neighbors |
Pointwise Mutual Information: How much more often do words co-occur than expected?
\[\text{PMI}(w, c) = \log_2 \frac{P(w, c)}{P(w) \cdot P(c)}\]
Worked example
Corpus: 1000 word pairs
\(P(\text{grape}, \text{wine}) = 8/1000 = 0.008\)
\(P(\text{grape}) \times P(\text{wine}) = 0.02 \times 0.05 = 0.001\)
\(\text{PMI} = \log_2(0.008/0.001) = \log_2(8) = 3\) bits
Interpretation: 8× more likely than chance → strong association!
Problem with PMI: If \(P(w,c) = 0\), then \(\text{PMI} = -\infty\)
Solution: Positive PMI (PPMI) — clip negative values to zero
\[\text{PPMI}(w, c) = \max(0, \text{PMI}(w, c))\]
Why clip at zero?
PPMI Matrix
Key insight: Mikolov et al. (2013) introduced a family of methods, not one algorithm
Core idea: Given a target word, predict its context words
Core idea: Given context words, predict the target word
| Aspect | CBOW | Skip-gram (SGNS) |
|---|---|---|
| Input | Context words | Target word |
| Output | Target word | Context words |
| Speed | Faster (1 prediction) | Slower (k predictions) |
| Rare words | Worse | Better |
Levy & Goldberg (2014): A Landmark Discovery
\[\text{SGNS implicitly factorizes:} \quad \mathbf{W} \cdot \mathbf{W'}^T \approx \text{PMI}(w,c) - \log k\]
What this means:
| Method | What it captures | Matrix factorized | Training |
|---|---|---|---|
| TF-IDF | Document-level topicality | Weighted term-doc | Direct computation |
| PMI/PPMI + SVD | Word co-occurrence strength | PPMI matrix | Count → SVD |
| Word2Vec (SGNS) | Shifted PMI (implicitly) | \(\text{PMI} - \log k\) | SGD prediction |
| GloVe | Log co-occurrence (explicit) | \(\log X\) weighted | SGD regression |
GloVe’s insight: If SGNS implicitly factorizes PMI, why not do it explicitly?
Word2Vec (SGNS)
GloVe
Result: Similar embeddings, different training dynamics
Static (Word2Vec, GloVe)
\(f: V \to \mathbb{R}^d\)
“The bank was steep”
→ [0.2, -0.5, …]
“The bank was closed”
→ [0.2, -0.5, …]
Same vector!
Contextualized (BERT)
\(f: (w, C) \to \mathbb{R}^d\)
“The bank was steep”
→ [0.3, 0.1, …]
“The bank was closed”
→ [-0.2, 0.4, …]
Different vectors!
\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]
Concept Check
Why might word analogy tasks (grape - vine + tree = apple) work BETTER with static embeddings than contextualized ones?
| Word Pair | WordSim-353 (relatedness) |
SimLex-999 (similarity) |
|---|---|---|
| car - gasoline | HIGH | LOW |
| coffee - cup | HIGH | LOW |
| car - automobile | HIGH | HIGH |
| doctor - man + woman = ? | Often returns "nurse" (bias!) |
| Paris - France + Japan = ? | Sometimes "Tokyo", often noise |
| bigger - big + small = ? | Rarely returns "smaller" |
| Task | Without pretrained | With Word2Vec | With BERT |
|---|---|---|---|
| Sentiment | 78% | 84% | 93% |
| NER | 81% | 88% | 95% |
| Question Answering | 65% | 72% | 89% |
\[s(X, Y, A, B) = \frac{1}{|X|} \sum_{x \in X} s(x, A, B) - \frac{1}{|Y|} \sum_{y \in Y} s(y, A, B)\]
Discussion
If occupation words like “engineer” are closer to “man” than “woman” in embedding space:
Key takeaway: Vector semantics transforms meaning into geometry—powerful but imperfect.
CSE 447/517 26wi - NLP