
2026-01-29
Content derived from: J&M Ch. 6
\[ \mathbf{w} \leftarrow \mathbf{w} + \eta (y - \hat{y}) \mathbf{x} \]

Solution: Add a hidden layer to create a non-linear decision boundary.
\[ \frac{\partial \mathcal{L}}{\partial \theta} = \frac{\partial \mathcal{L}}{\partial \mathbf{a}} \cdot \frac{\partial \mathbf{a}}{\partial \theta} \]
Applications span:
\[ h = f(\mathbf{w}^\top \mathbf{x} + b) \]
\[ \hat{y} = f^{[L]}(f^{[L-1]}(\cdots f^{[1]}(\mathbf{x})\cdots)) \]
\[ \mathbf{z} = \mathbf{W}\mathbf{x} + \mathbf{b} \]
\[ \mathbf{a} \cdot \mathbf{b} = \|\mathbf{a}\| \|\mathbf{b}\| \cos(\theta) \]
Neural networks use dot products to measure relevance between vectors
\[ (\mathbf{Q}\mathbf{K}^\top)_{ij} = \mathbf{q}_i \cdot \mathbf{k}_j \]
One matrix multiply computes all 12 similarities in parallel!
After computing \(\mathbf{z}\), we apply an activation function elementwise:
\[ \mathbf{h} = \phi(\mathbf{z}) \]

Why ReLU dominates: Constant gradient (1) for positive inputs prevents vanishing gradients in deep networks.
\[ \operatorname{softmax}(z_i) = \frac{\exp(z_i)}{\sum_{j=1}^d \exp(z_j)} \]
For layer \(i\):
\[ \mathbf{x}^{[i]} = f^{[i]}\left( \mathbf{W}^{[i]} \mathbf{x}^{[i-1]} + \mathbf{b}^{[i]} \right) \]
| Location | Common Choice | Purpose |
|---|---|---|
| Hidden layers | ReLU, GELU | Introduce nonlinearity, sparse activation |
| Binary output | Sigmoid | Probability in [0,1] |
| Multi-class output | Softmax | Probability distribution |
| Regression output | None (linear) | Unbounded real values |
Theorem: A feedforward network with one hidden layer and sufficient neurons can approximate any continuous function on a compact domain to arbitrary precision.
\[ \forall \epsilon > 0, \exists \hat{f}: \sup_{x \in K} |f(x) - \hat{f}(x)| < \epsilon \]
But this doesn’t mean shallow networks are always practical:
\[ \mathcal{L}(\theta) = \frac{1}{N} \sum_{i=1}^N \ell\big(f_\theta(\mathbf{x}_i), y_i\big) \]
Cross-entropy loss for classification:
\[ \ell_{\text{CE}}(\hat{\mathbf{y}}, \mathbf{y}) = -\sum_{k=1}^K y_k \log \hat{y}_k \]
| Task | Input \(\mathbf{x}\) | Output \(y\) |
|---|---|---|
| POS tagging | Word + context | POS tag (noun, verb, etc.) |
| Sentiment | Sentence/document | Sentiment class |
| NER | Word + context | Entity type or O |
| Text classification | Document | Topic/category |
\[ \frac{\partial L}{\partial W^{[l]}} = \delta^{[l]} (a^{[l-1]})^T \]
where \(\delta^{[l]} = \frac{\partial L}{\partial z^{[l]}}\) is the error signal at layer \(l\).
The recursive error signal computation:
\[ \delta^{[l]} = \left( W^{[l+1]}\right)^T \delta^{[l+1]} \odot \sigma' (z^{[l]}) \]
Weight update rule:
\[ W^{[l]} \leftarrow W^{[l]} - \eta \frac{\partial L}{\partial W^{[l]}} \]
\[ b^{[l]} \leftarrow b^{[l]} - \eta \frac{\partial L}{\partial b^{[l]}} \]

Gradient descent follows the steepest downhill direction; step size η determines how far we move each update.
Common misconception: Backprop = training algorithm
Reality:
Stochastic Gradient Descent:
\[ \theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta_t; \mathcal{B}_t) \]
Adam update equations:
\[ m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla_\theta L(\theta_t) \]
\[ v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla_\theta L(\theta_t))^2 \]
\[ \theta_{t+1} = \theta_t - \eta \frac{m_t}{\sqrt{v_t} + \epsilon} \]

Adam combines momentum’s smoothing with per-parameter learning rate scaling—faster and more robust convergence.
Improving neural networks by preventing co-adaptation of feature detectors (Hinton et al., 2012)
\[ \delta^{[l]} = (W^{[l+1]})^T \delta^{[l+1]} \odot \sigma'(z^{[l]}) \]
Solutions:
Xavier/Glorot initialization:
\[ \text{Var}[W_{ij}] = \frac{2}{n_{in} + n_{out}} \]
He initialization (for ReLU):
\[ \text{Var}[W_{ij}] = \frac{2}{n_{in}} \]
\[ \mathbf{h}^{[l+1]} = f(\mathbf{h}^{[l]}) + \mathbf{h}^{[l]} \]
Why it helps: Gradient flows directly through the skip connection:
\[ \frac{\partial \mathbf{h}^{[l+1]}}{\partial \mathbf{h}^{[l]}} = \frac{\partial f}{\partial \mathbf{h}^{[l]}} + \mathbf{I} \]

\[ \text{LayerNorm}(\mathbf{x}) = \gamma \odot \frac{\mathbf{x} - \mu}{\sigma + \epsilon} + \beta \]

| BatchNorm | LayerNorm | RMSNorm | |
|---|---|---|---|
| Normalizes | Batch dim | Feature dim | Feature dim |
| Train/Test | Different | Same | Same |
| Used in | CNNs | Transformers | LLaMA, Gemma |
\[ \mathbf{h}_t = f(\mathbf{W}_{xh} \mathbf{x}_t + \mathbf{W}_{hh} \mathbf{h}_{t-1} + \mathbf{b}_h) \]
Implications:
Autoregressive (left-to-right): Each position only sees the past
\[ P(x_1, x_2, \ldots, x_n) = \prod_{t=1}^{n} P(x_t | x_1, \ldots, x_{t-1}) \]
Bidirectional: Each position sees the entire sequence
\[ \text{output}_t = \sum_{j=1}^{t} \alpha_{tj} \cdot \mathbf{v}_j \]

Key question: How do we inject position information into parallel architectures?
Add a position-dependent vector to each token embedding:
\[ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right) \quad PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right) \]

Key properties: Each position gets a unique encoding; relative positions computable via linear transformation; generalizes to longer sequences.
| Task | Architecture | Output |
|---|---|---|
| Text classification | Feedforward / CNN / Transformer | Class probabilities (softmax) |
| Language modeling | RNN / Transformer | Next token probabilities |
| Sequence labeling | BiLSTM / Transformer | Tag per token |
| Machine translation | Encoder-decoder | Target sequence |
Applications: Sentiment analysis, spam detection, topic classification
\[ P(w_t | w_1, \ldots, w_{t-1}) = \text{softmax}(\mathbf{W} \mathbf{h}_{t-1} + \mathbf{b}) \]
Bengio et al. (2003) neural language model: word embeddings → hidden layer → softmax over vocabulary
Textbooks:
Coming up next: Transformers and attention mechanisms
Resources: