Noah Smith will give a guest lecture February 3rd, please attend in person!
A1 will be released Thursday, plan ahead
Sources
Content derived from: J&M Ch. 4
Part 1: Foundations of Text Classification
Text classification assigns predefined categories to text using supervised learning. (1/5)
Text classification assigns predefined categories to text using supervised learning.
Let \(x\) denote an input text (e.g., document, sentence), and \(y\) a discrete label.
The classification function \(f_\theta(x): \mathcal{X} \rightarrow \mathcal{Y}\) is learned from labeled data \(\mathcal{D} = \{(x^{(i)}, y^{(i)})\}_{i=1}^N\).
Text classification assigns predefined categories to text using supervised learning. (2/5)
where \(\mathrm{tf}(t, d) = \frac{c(t, d)}{\sum_{t'} c(t', d)}\) and \(\mathrm{idf}(t) = \log \frac{N}{n_t}\) with \(N\) total documents and \(n_t\) documents containing \(t\).
Discriminative models directly model P(y|x), focusing on decision boundaries between classes. (1/4)
Discriminative models directly estimate conditional probability \(P(y|x)\), emphasizing decision boundaries.
The model focuses on learning the mapping from features \(x\) to labels \(y\), rather than modeling \(P(x)\) or \(P(x, y)\).
Contrasts with generative models, which require explicit modeling of the joint distribution \(P(x, y)\) or the marginal \(P(x)\).
Inductive bias is centered on maximizing separation between classes in feature space.
Discriminative models directly model P(y|x), focusing on decision boundaries between classes. (2/4)
Logistic regression leverages feature vectors \(x\) and weight parameters \(w\) to model \(P(y=1|x)\) via the sigmoid activation.
The model computes:
\[
P(y=1|x) = \sigma(w \cdot x + b) = \frac{1}{1 + e^{-(w \cdot x + b)}}
\]
Logistic regression computes a weighted sum (logit) and applies a sigmoid.
%%{init: {'theme': 'base', 'themeVariables': {'lineColor': '#2f2f2f', 'textColor': '#111111', 'primaryBorderColor': '#2f2f2f', 'fontSize': '18px'}}}%%
flowchart LR
X["$$x = [x_1, x_2, x_3]$$"]
W["$$w = [w_1, w_2, w_3]$$"]
B["$$b$$"]
Z["$$z = w^\\top x + b$$"]
S["$$\\hat{y} = \\sigma(z)$$"]
P["$$P(y=1\\mid x)=\\hat{y}$$"]
X --> Z
W --> Z
B --> Z
Z --> S --> P
classDef input fill:#e9f2ff,stroke:#1f4e79,stroke-width:1.2px,color:#0b1f33;
classDef compute fill:#f3f7ff,stroke:#1f4e79,stroke-width:1.2px,color:#0b1f33;
classDef output fill:#fff1e6,stroke:#8a3b12,stroke-width:1.2px,color:#3b1a06;
class X,W,B input;
class Z,S compute;
class P output;
Discriminative models directly model P(y|x), focusing on decision boundaries between classes. (3/4)
Training involves optimizing weights \(w\) and bias \(b\) to minimize the cross-entropy loss:
This is exactly the binary cross-entropy (logistic) loss.
Iris dataset: binary classification
Classic Iris measurements (sepal/petal lengths) with logistic regression classifying setosa vs. versicolor using a 2D decision boundary.
Discriminative models directly model P(y|x), focusing on decision boundaries between classes. (4/4)
Discriminative approaches enable robust text classification by allowing targeted feature engineering and direct optimization for accuracy.
Feature engineering can encode linguistic, lexical, or syntactic cues (e.g., word presence, n-grams, TF-IDF scores).
Empirical performance improves as features are tailored to the structure and nuances of text data.
Example: In sentiment classification, features such as polarity lexicon counts or phrase patterns can be incorporated to improve \(P(y|x)\) estimation.
Binary logistic regression models the probability of a binary outcome using the sigmoid function. (1/7)
Binary logistic regression models the probability of a binary outcome using the sigmoid function.
For input \(\mathbf{x} \in \mathbb{R}^d\), the model defines the probability of class \(y \in \{0,1\}\) as:
\[
P(y=1|\mathbf{x}; \mathbf{w}, b) = \sigma(\mathbf{w}^\top \mathbf{x} + b)
\]
where \(\sigma(z) = \frac{1}{1 + e^{-z}}\) is the sigmoid activation.
Binary logistic regression models the probability of a binary outcome using the sigmoid function. (2/7)
Intuition: The sigmoid maps real-valued scores to \([0,1]\), enabling probabilistic interpretation for binary classification.
Applications:
Text sentiment classification (positive/negative)
Spam detection (spam/not spam)
Medical diagnosis (disease/no disease)
Binary logistic regression models the probability of a binary outcome using the sigmoid function. (3/7)
The model uses cross-entropy loss and optimizes parameters via (stochastic) gradient descent.
The cross-entropy loss for a single data point is:
Binary logistic regression models the probability of a binary outcome using the sigmoid function. (5/7)
Binary logistic regression models the probability of a binary outcome using the sigmoid function. (6/7)
Stochastic gradient descent (SGD) updates parameters using individual samples, improving convergence on large datasets.
This enables effective classification and sets the stage for regularization to prevent overfitting.
Logistic regression provides probabilistic outputs, interpretable coefficients, and a convex loss surface, facilitating robust training.
Overfitting can occur, especially with high-dimensional data; regularization (e.g., L1, L2 penalties) mitigates this by constraining parameter magnitudes.
Binary logistic regression models the probability of a binary outcome using the sigmoid function. (7/7)
Next:
We will examine regularization strategies and their effect on generalization in logistic regression.
Regularization penalties prevent overfitting by constraining parameter magnitudes. (1/3)
Regularization adds a penalty term to the loss function to discourage large parameter values.
MAP estimation adds the penalty \(\lambda \|\mathbf{w}\|_1\) to the loss.
Regularization penalties prevent overfitting by constraining parameter magnitudes. (3/3)
L1 regularization promotes sparsity by setting many weights to zero, enabling feature selection.
L2 regularization shrinks weights uniformly, improving generalization without feature selection.
Practical guidance:
Use L2 (Ridge) for dense feature spaces or when all features may be informative.
Use L1 (Lasso) when feature selection is desired or the feature space is sparse.
Elastic Net combines L1 and L2 for balanced regularization.
Multiclass logistic regression can be done via one-vs-rest or softmax approaches. (1/3)
Multiclass logistic regression can be performed using either one-vs-rest or softmax approaches.
In the one-vs-rest (OvR) strategy, \(K\) binary classifiers are trained, one per class, each distinguishing one class from all others.
For class \(k\), the classifier computes \(P(y = k \mid \mathbf{x}) = \sigma(\mathbf{w}_k^\top \mathbf{x} + b_k)\)
The predicted class is \(\arg\max_k P(y = k \mid \mathbf{x})\).
The softmax approach generalizes logistic regression to multiple classes by modeling all classes jointly.
Multiclass logistic regression can be done via one-vs-rest or softmax approaches. (2/3)
For \(K\) classes, the probability of class \(k\) is: \[
P(y = k \mid \mathbf{x}) = \frac{\exp(\mathbf{w}_k^\top \mathbf{x} + b_k)}{\sum_{j=1}^{K} \exp(\mathbf{w}_j^\top \mathbf{x} + b_j)}
\]
The predicted class is again \(\arg\max_k P(y = k \mid \mathbf{x})\).
Both approaches use the cross-entropy loss, but the softmax formulation yields a single, vector-valued gradient, while OvR involves \(K\) separate binary losses.
Multiclass logistic regression can be done via one-vs-rest or softmax approaches. (3/3)
Applications:
Text classification with more than two categories (e.g., topic or sentiment classification).
Part-of-speech tagging, where each word must be assigned to one of many possible tags.
Multiclass logistic regression: OvR vs. softmax (diagram)
%%{init: {'theme': 'base', 'themeVariables': {'lineColor': '#2f2f2f', 'textColor': '#111111', 'primaryBorderColor': '#2f2f2f', 'fontSize': '18px'}}}%%
flowchart LR
X["$$x \\in \\mathbb{R}^3$$"]
subgraph A["OvR (K=3 sigmoids)"]
direction LR
O1["$$z_k = w_k^\\top x + b_k,\\ k\\in\\{1,2,3\\}$$"] --> O2["$$\\hat{p}_k = \\sigma(z_k)$$"] --> O3["$$\\hat{y} = \\arg\\max_k \\hat{p}_k$$"]
end
subgraph B["Softmax (shared norm.)"]
direction LR
S1["$$z_k = w_k^\\top x + b_k,\\ k\\in\\{1,2,3\\}$$"] --> S2["$$\\hat{p}_k = \\dfrac{e^{z_k}}{\\sum_{j=1}^{3} e^{z_j}}$$"] --> S3["$$\\hat{y} = \\arg\\max_k \\hat{p}_k$$"]
end
X --> O1
X --> S1
classDef ovr fill:#e9f2ff,stroke:#1f4e79,stroke-width:1.2px,color:#0b1f33;
classDef soft fill:#fff1e6,stroke:#8a3b12,stroke-width:1.2px,color:#3b1a06;
class X,O1,O2,O3 ovr;
class S1,S2,S3 soft;
style A fill:#f3f7ff,stroke:#1f4e79,color:#0b1f33;
style B fill:#fff5ed,stroke:#8a3b12,color:#3b1a06;
Takeaways:
OvR is simpler to train with binary solvers and allows per-class thresholds.
Softmax provides a single, normalized probability distribution across classes.
Part 3: Statistical and Experimental Considerations
Statistical significance testing is essential for validating NLP experiment results. (1/8)
Example: Trained logistic regression on a toy spam/ham dataset.
We evaluate predictions with a confusion matrix before running significance tests.
Statistical significance testing is essential for validating NLP experiment results. (2/8)
Statistical hypothesis testing quantifies whether observed performance differences are likely due to chance.
Null hypothesis \(H_0\): No difference between systems’ true performance.
\(p\)-value: Probability of observing results at least as extreme as those measured, assuming \(H_0\) is true.
Statistical significance testing is essential for validating NLP experiment results. (3/8)
In NLP, model evaluation metrics (e.g., accuracy, F1) are subject to sampling noise.
Random train/test splits and annotation errors introduce variance.
Without significance testing, small metric improvements may be spurious.
Example:
Comparing two classifiers with 80.2% vs. 80.7% accuracy on a test set of size \(N\).
Is the 0.5% difference meaningful, or within random variation?
Statistical significance testing is essential for validating NLP experiment results. (4/8)
Methods like bootstrap confidence intervals and tests across datasets assess result reliability.
The bootstrap estimates confidence intervals by repeatedly resampling the test set:
\[
\text{For } b = 1, \dots, B: \quad \text{Sample with replacement to create set } D_b
\]
Statistical significance testing is essential for validating NLP experiment results. (5/8)
Statistical significance testing is essential for validating NLP experiment results. (6/8)
Significance testing across datasets (e.g., paired \(t\)-test, approximate randomization) accounts for correlation and variance:
Paired \(t\)-test: Compare metric differences per example across systems.
Randomization: Shuffle system outputs to simulate null hypothesis.
Application:
Dror et al. (2017) recommend testing across multiple datasets for robustness.
Statistical significance testing is essential for validating NLP experiment results. (7/8)
Proper significance reporting ensures replicability and trust in classification experiments.
Reporting standards include:
Declaring test set size, number of runs, and test statistic used.
Reporting confidence intervals, not just point estimates.
Statistical significance testing is essential for validating NLP experiment results. (8/8)
Replicability crisis in NLP highlights the necessity of statistical rigor.
Example reporting statement:
“System A outperforms System B on F1 (\(p = 0.03\), 95% CI: [0.02, 0.08]) across 10 datasets.”
Bootstrap comparison: spam/ham text models
Model A: Bag-of-words counts + L2 logistic regression.
Model B: TF-IDF unigrams + L2 logistic regression.
Part 4: Case Study: 20 Newsgroups Classification
20 Newsgroups: classification task
Predict the discussion group label from the post text.
Usenet posts from 20 topical forums (sports, politics, tech, religion).
20 categories, balanced enough that accuracy is meaningful.
We strip headers/footers/quotes to focus on content.
Dataset overview
Train/test splits come from scikit-learn’s fetch_20newsgroups.
Each example is a short, noisy, user-generated post.
Train size: 11314 Test size: 7532
Classes: 20
Example posts (truncated)
Look for topical keywords that hint at the group label.
[rec.autos] I was wondering if anyone out there could enlighten me on this car I saw the other day. It was a 2-door sports car, looked to be from the late 60s/ early 70s. It was called a...
[comp.sys.mac.hardware] --
[comp.graphics] Hello, I am looking to add voice input capability to a user interface I am developing on an HP730 (UNIX) workstation. I would greatly appreciate information anyone would care to...
Bigram example: phrase cues
Bigrams capture short phrases (e.g., “space shuttle”, “power supply”).