Neural Networks

Robert Minneker

2026-01-29

Sources

Content derived from: J&M Ch. 6

Part 1: Foundations of Neural Networks

Neural networks emerged from decades of research on connectionist computation

Early foundations: McCulloch & Pitts (1943), Hebb (1949), Turing (1948)
Turing’s insight: Complex behaviors emerge from many simple interacting units
The connectionist paradigm: computation through distributed, parallel processing

1943

McCulloch-Pitts
Neuron

→

1958

Rosenblatt's
Perceptron

→

1986

Backpropagation

→

2012

Deep Learning
Revolution

→

2017

Transformers

The perceptron was the first learnable neural architecture

Rosenblatt (1958) formalized learning from labeled examples
Update rule adjusts weights based on prediction errors:

\[ \mathbf{w} \leftarrow \mathbf{w} + \eta (y - \hat{y}) \mathbf{x} \]

Enabled learning of linear decision boundaries in input space

Perceptron Can Learn

AND, OR, NOT gates

Linearly separable problems

Perceptron Cannot Learn

XOR gate

Non-linearly separable

Why XOR breaks the perceptron

The XOR problem: no single line can separate the classes. Red points (output=1) sit on opposite corners.

Solution: Add a hidden layer to create a non-linear decision boundary.

Backpropagation enabled training of multi-layer networks

Rumelhart, Hinton, and Williams (1986) introduced error backpropagation
Uses the chain rule to compute gradients through layers:

\[ \frac{\partial \mathcal{L}}{\partial \theta} = \frac{\partial \mathcal{L}}{\partial \mathbf{a}} \cdot \frac{\partial \mathbf{a}}{\partial \theta} \]

Allowed training networks with hidden layers, overcoming perceptron limitations

Modern architectures build on these foundational principles

Hierarchical representation learning extracts increasingly abstract features
Non-linear function approximation enables modeling complex relationships
Scalability through parallelization and specialized hardware (GPUs, TPUs)

Applications span:

NLP: machine translation, question answering, text generation
Vision: object recognition, image segmentation
Reinforcement learning: game playing, robotics

A neural network is a computational graph of interconnected processing units

Each neuron computes a weighted sum of inputs plus bias, then applies an activation:

\[ h = f(\mathbf{w}^\top \mathbf{x} + b) \]

\(\mathbf{w}\): weight vector (learned parameters)
\(\mathbf{x}\): input vector
\(b\): bias term
\(f\): nonlinear activation function

Networks are organized into layers with distinct roles

Input Layer

x₁

x₂

x₃

→

Hidden Layer(s)

h₁

h₂

→

Output Layer

ŷ₁

ŷ₂

Input layer: Receives raw features (e.g., word embeddings)
Hidden layers: Learn intermediate representations
Output layer: Produces predictions (often via softmax)

Feedforward networks process information in one direction only

Data flows from input → hidden → output with no cycles
Each layer’s output becomes the next layer’s input
The entire computation is a composition of functions:

\[ \hat{y} = f^{[L]}(f^{[L-1]}(\cdots f^{[1]}(\mathbf{x})\cdots)) \]

Recurrent networks can model sequential dependencies through cycles

Allow information to persist across time steps
Hidden state \(\mathbf{h}_t\) encodes history of previous inputs
Essential for language modeling, speech recognition, time series

Feedforward

Fixed-size input → Fixed-size output

Image classification, sentiment analysis

Recurrent

Sequence → Sequence or output

Language modeling, machine translation

Neural network computation is fundamentally linear algebra

Inputs, parameters, and activations are vectors and matrices
Core operation: matrix-vector multiplication plus bias

\[ \mathbf{z} = \mathbf{W}\mathbf{x} + \mathbf{b} \]

Dimensions must align: if \(\mathbf{x} \in \mathbb{R}^d\) and \(\mathbf{W} \in \mathbb{R}^{m \times d}\), then \(\mathbf{z} \in \mathbb{R}^m\)

Dot products measure vector similarity

\[ \mathbf{a} \cdot \mathbf{b} = \|\mathbf{a}\| \|\mathbf{b}\| \cos(\theta) \]

Similar (θ ≈ 0°)

→ →

a · b > 0

Vectors point same direction

Orthogonal (θ = 90°)

→ ↑

a · b = 0

Vectors are unrelated

Opposite (θ ≈ 180°)

→ ←

a · b < 0

Vectors point opposite

Neural networks use dot products to measure relevance between vectors

Matrix multiplication computes all pairwise dot products

If \(\mathbf{Q} \in \mathbb{R}^{n \times d}\) and \(\mathbf{K} \in \mathbb{R}^{m \times d}\), then:

\[ (\mathbf{Q}\mathbf{K}^\top)_{ij} = \mathbf{q}_i \cdot \mathbf{k}_j \]

Q (3×d)

q₁

q₂

q₃

K^T (d×4)

k₁

k₂

k₃

k₄

Similarity (3×4)

q₁·k₁

q₁·k₂

q₁·k₃

q₁·k₄

q₂·k₁

q₂·k₂

q₂·k₃

q₂·k₄

q₃·k₁

q₃·k₂

q₃·k₃

q₃·k₄

One matrix multiply computes all 12 similarities in parallel!

Activation functions introduce essential nonlinearity

After computing \(\mathbf{z}\), we apply an activation function elementwise:

\[ \mathbf{h} = \phi(\mathbf{z}) \]

ReLU

max(0, z)

Simple, fast, sparse

Sigmoid

1/(1+e⁻ᶻ)

Output in (0,1)

Tanh

(eᶻ-e⁻ᶻ)/(eᶻ+e⁻ᶻ)

Output in (-1,1)

Activation function shapes determine gradient flow

Activation functions and their gradients. ReLU dominates modern networks due to its constant gradient for positive inputs.

Why ReLU dominates: Constant gradient (1) for positive inputs prevents vanishing gradients in deep networks.

The softmax function converts scores to probabilities

Transforms a vector \(\mathbf{z}\) into a probability distribution:

\[ \operatorname{softmax}(z_i) = \frac{\exp(z_i)}{\sum_{j=1}^d \exp(z_j)} \]

Guarantees: \(\sum_i \operatorname{softmax}(z_i) = 1\) and \(0 < \operatorname{softmax}(z_i) < 1\)

Part 2: Feedforward Neural Networks

Each layer transforms its input through weights, bias, and activation

For layer \(i\):

\[ \mathbf{x}^{[i]} = f^{[i]}\left( \mathbf{W}^{[i]} \mathbf{x}^{[i-1]} + \mathbf{b}^{[i]} \right) \]

\(\mathbf{W}^{[i]}\): learnable weight matrix
\(\mathbf{b}^{[i]}\): learnable bias vector
\(f^{[i]}\): activation function

x⁽ⁱ⁻¹⁾

→

W⁽ⁱ⁾x + b⁽ⁱ⁾

→

f⁽ⁱ⁾(·)

→

x⁽ⁱ⁾

Activation functions serve different purposes in different layers

Location	Common Choice	Purpose
Hidden layers	ReLU, GELU	Introduce nonlinearity, sparse activation
Binary output	Sigmoid	Probability in [0,1]
Multi-class output	Softmax	Probability distribution
Regression output	None (linear)	Unbounded real values

The universal approximation theorem guarantees theoretical expressivity

Theorem: A feedforward network with one hidden layer and sufficient neurons can approximate any continuous function on a compact domain to arbitrary precision.

\[ \forall \epsilon > 0, \exists \hat{f}: \sup_{x \in K} |f(x) - \hat{f}(x)| < \epsilon \]

But this doesn’t mean shallow networks are always practical:

May require exponentially many neurons
Says nothing about learnability or generalization

Depth enables efficient representation of compositional structure

Shallow Network

O(2ⁿ)

neurons for some functions

Deep Network

O(n)

neurons for same functions

Telgarsky (2016): Some functions require exponentially more units in shallow vs. deep networks
Depth captures hierarchical/compositional structure naturally
Language has recursive, hierarchical structure → depth helps

Part 3: Learning and Training Algorithms

Supervised learning optimizes parameters using labeled examples

Training data: pairs \((\mathbf{x}, y)\) of inputs and target outputs
Model predicts: \(\hat{y} = f_\theta(\mathbf{x})\)
Objective: minimize average loss over training set

\[ \mathcal{L}(\theta) = \frac{1}{N} \sum_{i=1}^N \ell\big(f_\theta(\mathbf{x}_i), y_i\big) \]

Loss functions measure prediction quality

Cross-Entropy (Classification)

ℓ = −Σₖ yₖ log ŷₖ

Penalizes low probability on correct class

MSE (Regression)

ℓ = (y − ŷ)²

Penalizes squared deviation from target

Cross-entropy loss for classification:

\[ \ell_{\text{CE}}(\hat{\mathbf{y}}, \mathbf{y}) = -\sum_{k=1}^K y_k \log \hat{y}_k \]

NLP applications of supervised learning span many tasks

Task	Input \(\mathbf{x}\)	Output \(y\)
POS tagging	Word + context	POS tag (noun, verb, etc.)
Sentiment	Sentence/document	Sentiment class
NER	Word + context	Entity type or O
Text classification	Document	Topic/category

Backpropagation computes gradients efficiently via the chain rule

We need \(\frac{\partial L}{\partial W^{[l]}}\) for each layer to update weights
Backprop propagates error signals backward through the network

\[ \frac{\partial L}{\partial W^{[l]}} = \delta^{[l]} (a^{[l-1]})^T \]

where \(\delta^{[l]} = \frac{\partial L}{\partial z^{[l]}}\) is the error signal at layer \(l\).

Error signals propagate backward through the network

Forward Pass

Input x

→

Hidden h¹

→

Hidden h²

→

Output ŷ

Backward Pass

∂L/∂ŷ

←

δ²

←

δ¹

←

∂L/∂x

The recursive error signal computation:

\[ \delta^{[l]} = \left( W^{[l+1]}\right)^T \delta^{[l+1]} \odot \sigma' (z^{[l]}) \]

Gradients drive parameter updates via gradient descent

Weight update rule:

\[ W^{[l]} \leftarrow W^{[l]} - \eta \frac{\partial L}{\partial W^{[l]}} \]

\[ b^{[l]} \leftarrow b^{[l]} - \eta \frac{\partial L}{\partial b^{[l]}} \]

\(\eta\): learning rate (critical hyperparameter)
Too high → unstable, may diverge; Too low → slow convergence

Gradient descent navigates the loss landscape

Gradient descent behavior with different learning rates. The learning rate η controls convergence speed and stability.

Gradient descent follows the steepest downhill direction; step size η determines how far we move each update.

Backpropagation is not an optimizer—it computes gradients for optimizers

Common misconception: Backprop = training algorithm

Reality:

Backprop: efficient gradient computation method
SGD/Adam/etc.: optimization algorithms that use gradients
Together they enable training, but they’re distinct concepts

Loss L

→

Backpropagation

→

Gradients ∇L

→

Optimizer
(SGD, Adam)

→

New Weights

SGD processes mini-batches for efficient, noisy updates

Stochastic Gradient Descent:

\[ \theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta_t; \mathcal{B}_t) \]

\(\mathcal{B}_t\): random mini-batch at step \(t\)
Noisy gradients can help escape local minima
Much faster than full-batch gradient descent

Adam combines momentum and adaptive learning rates

Adam update equations:

\[ m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla_\theta L(\theta_t) \]

\[ v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla_\theta L(\theta_t))^2 \]

\[ \theta_{t+1} = \theta_t - \eta \frac{m_t}{\sqrt{v_t} + \epsilon} \]

\(m_t\): momentum (smoothed gradient)
\(v_t\): adaptive scaling (per-parameter)
Default: \(\beta_1=0.9\), \(\beta_2=0.999\), \(\epsilon=10^{-8}\)

Optimizers behave differently on complex loss surfaces

Optimizer behavior on a saddle-like loss surface. SGD oscillates, Momentum smooths, Adam adapts step sizes per-parameter.

Adam combines momentum’s smoothing with per-parameter learning rate scaling—faster and more robust convergence.

Regularization prevents overfitting by constraining model capacity

L2 Regularization

L' = L + λ‖θ‖²

Penalizes large weights

Dropout

hᵢ = rᵢ · hᵢ

rᵢ ~ Bernoulli(p)

Early Stopping

Stop when val↑

Monitor validation loss

Dropout forces the network to learn redundant representations

During training: randomly set fraction \(p\) of activations to 0
During inference: use all neurons, scale by \((1-p)\)
Effect: no single neuron can become too important

Training (with dropout)

→

Some neurons "dropped"

Inference (no dropout)

→

All neurons active

Dropout signficantly reduces overfitting

Improving neural networks by preventing co-adaptation of feature detectors (Hinton et al., 2012)

Part 4: Advanced Architectures and Extensions

Vanishing gradients make training deep networks difficult

Gradients shrink exponentially when propagated through many layers
For sigmoid: \(\sigma'(z) \leq 0.25\), so gradient decays by at least 4× per layer

\[ \delta^{[l]} = (W^{[l+1]})^T \delta^{[l+1]} \odot \sigma'(z^{[l]}) \]

With 10 layers: gradient shrinks by \(\approx 4^{10} \approx 10^6\)

Exploding gradients cause training instability

The opposite problem: gradients grow exponentially
Occurs when \(\|W^{[l]}\| > 1\) and compounds across layers
Symptoms: NaN losses, weights becoming infinite

Solutions:

Gradient clipping: \(g \leftarrow g \cdot \frac{\tau}{\|g\|}\) if \(\|g\| > \tau\)
Careful weight initialization
Batch normalization

Proper initialization maintains gradient flow

Xavier/Glorot initialization:

\[ \text{Var}[W_{ij}] = \frac{2}{n_{in} + n_{out}} \]

He initialization (for ReLU):

\[ \text{Var}[W_{ij}] = \frac{2}{n_{in}} \]

Goal: keep activation and gradient variance stable across layers
Critical for training networks with 10+ layers

Residual connections enable training of very deep networks

Key idea: Add input directly to output via “skip connection”

\[ \mathbf{h}^{[l+1]} = f(\mathbf{h}^{[l]}) + \mathbf{h}^{[l]} \]

h⁽ˡ⁾

↓

f(·)

↓

skip

→

h⁽ˡ⁺¹⁾

Why it helps: Gradient flows directly through the skip connection:

\[ \frac{\partial \mathbf{h}^{[l+1]}}{\partial \mathbf{h}^{[l]}} = \frac{\partial f}{\partial \mathbf{h}^{[l]}} + \mathbf{I} \]

Residual connections: empirical evidence

Training error on CIFAR-10: without residuals, a 56-layer “plain” network performs worse than a 20-layer network—even on training data. With residual connections (right), deeper networks consistently outperform shallower ones.

Layer normalization stabilizes training of deep networks

\[ \text{LayerNorm}(\mathbf{x}) = \gamma \odot \frac{\mathbf{x} - \mu}{\sigma + \epsilon} + \beta \]

\(\mu, \sigma\): mean and std computed across features (per example)
\(\gamma, \beta\): learnable scale and shift parameters

Batch Normalization

Normalize across batch dimension

• Depends on batch statistics
• Different behavior train vs test
• Problematic for variable-length sequences

Layer Normalization

Normalize across feature dimension

• Independent of batch size
• Same behavior train and test
• Works with any sequence length ✓

Comparing normalization techniques: BatchNorm vs LayerNorm vs RMSNorm

	BatchNorm	LayerNorm	RMSNorm
Normalizes	Batch dim	Feature dim	Feature dim
Train/Test	Different	Same	Same
Used in	CNNs	Transformers	LLaMA, Gemma

RNNs model sequences by maintaining state across time steps

\[ \mathbf{h}_t = f(\mathbf{W}_{xh} \mathbf{x}_t + \mathbf{W}_{hh} \mathbf{h}_{t-1} + \mathbf{b}_h) \]

h₁

↑

x₁

→

h₂

↑

x₂

→

h₃

↑

x₃

→

h₄

↑

x₄

→

Hidden state \(\mathbf{h}_t\) encodes information from \(x_1, \ldots, x_t\)
Same parameters (\(\mathbf{W}_{xh}, \mathbf{W}_{hh}\)) used at every time step

The RNN sequential bottleneck limits parallelization

\(\mathbf{h}_t\) depends on \(\mathbf{h}_{t-1}\) creates a dependency chain
Cannot compute \(\mathbf{h}_5\) until \(\mathbf{h}_1, \mathbf{h}_2, \mathbf{h}_3, \mathbf{h}_4\) are finished

h₁

WAIT

→

h₂

WAIT

→

h₃

WAIT

→

h₄

WAIT

→

h₅

O(T) sequential steps for sequence length T

Implications:

Cannot fully utilize parallel hardware (GPUs have thousands of cores)
Training time scales linearly with sequence length
Key question: What if we could process all tokens at once?

Autoregressive vs bidirectional processing

Autoregressive (left-to-right): Each position only sees the past

\[ P(x_1, x_2, \ldots, x_n) = \prod_{t=1}^{n} P(x_t | x_1, \ldots, x_{t-1}) \]

Bidirectional: Each position sees the entire sequence

Autoregressive (GPT)

The

cat

sat

Position 3 sees only positions 1-2

Bidirectional (BERT)

The

cat

[MASK]

mat

Position 3 sees all positions

Standard RNNs struggle with long-range dependencies

"The cat, which was sitting on the mat in the living room where my grandmother used to read her favorite novels during the winter evenings, was hungry."

The verb "was" must agree with "cat" despite 20+ intervening words.

Gradient signal degrades over long distances
LSTM and GRU use gating mechanisms to preserve information
Transformers use attention to directly connect distant positions

Context as a weighted combination of representations

What if we could directly access all previous representations?
Key insight: compute a weighted sum based on relevance

\[ \text{output}_t = \sum_{j=1}^{t} \alpha_{tj} \cdot \mathbf{v}_j \]

\(\alpha_{tj}\): attention weight (how relevant is position \(j\) to position \(t\)?)
\(\mathbf{v}_j\): value vector at position \(j\)

v₁

α₄₁ = 0.1

v₂

α₄₂ = 0.5

v₃

α₄₃ = 0.3

→

output₄

Σ αᵢ · vᵢ

Attention weights reveal what the model focuses on

Attention from "it" in: "The cat sat on the mat because it was tired"

The

cat

sat

the

mat

because

was

it →

0.05

0.45

0.10

0.03

0.02

0.15

0.05

0.10

0.05

What attention reveals:

The pronoun "it" attends most strongly to "cat" (0.45)—the model has learned coreference!

Darker = higher attention weight

BERT attention learns coreference resolution

Attention patterns from BERT showing coreference resolution. The model learns to link pronouns to their antecedents—here “it” attends strongly to “cat” and “dog” to resolve what each pronoun refers to.

Position information must be explicitly encoded

Without recurrence, order information is lost
“Dog bit man” and “man bit dog” would be identical!

Without Position Info

{dog, bit, man} = {man, bit, dog}

Bag of words—order lost!

With Position Encoding

(dog, 1), (bit, 2), (man, 3)

Order preserved via position

Key question: How do we inject position information into parallel architectures?

Sinusoidal position encodings (Vaswani et al., 2017)

Add a position-dependent vector to each token embedding:

\[ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right) \quad PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right) \]

Sinusoidal position encodings: each position gets a unique pattern. Low dimensions oscillate rapidly (fine position), high dimensions oscillate slowly (coarse position).

Key properties: Each position gets a unique encoding; relative positions computable via linear transformation; generalizes to longer sequences.

Part 5: Applications

Neural networks power core NLP tasks

Task	Architecture	Output
Text classification	Feedforward / CNN / Transformer	Class probabilities (softmax)
Language modeling	RNN / Transformer	Next token probabilities
Sequence labeling	BiLSTM / Transformer	Tag per token
Machine translation	Encoder-decoder	Target sequence

Text classification assigns labels to documents

Document
'Great movie!'

→

Embedding
Layer

→

Encoder
(CNN/LSTM/Transformer)

→

Pooling

→

Dense

→

Softmax

→

P(pos)=0.92
P(neg)=0.08

Applications: Sentiment analysis, spam detection, topic classification

Language modeling predicts the next token in a sequence

\[ P(w_t | w_1, \ldots, w_{t-1}) = \text{softmax}(\mathbf{W} \mathbf{h}_{t-1} + \mathbf{b}) \]

Bengio et al. (2003) neural language model: word embeddings → hidden layer → softmax over vocabulary

Neural networks excel across diverse domains

Computer Vision

CNNs for image classification, object detection, segmentation

Reinforcement Learning

Deep Q-networks, policy gradients for game playing, robotics

Bioinformatics

Protein structure prediction (AlphaFold), genomics

Recommendations

Neural collaborative filtering, content-based systems

Same fundamental principles (backprop, gradient descent) apply
Architecture choices encode domain-specific inductive biases

Part 6: Further Reading and Historical Notes

Key references for deeper understanding

Textbooks:

Goodfellow, Bengio, & Courville (2016), Deep Learning — comprehensive theory and practice
Jurafsky & Martin, Speech and Language Processing Ch. 6-8

Key milestones in neural network history

Foundations

1943 — McCulloch-Pitts neuron

1958 — Perceptron (Rosenblatt)

1969 — Perceptrons book → AI Winter

Revival

1986 — Backpropagation

1990 — Elman RNNs

1997 — LSTM (Hochreiter & Schmidhuber)

Modern Era

2012 — AlexNet → Deep learning revolution

2017 — Transformer architecture

2018 — BERT, GPT

Summary: Neural Networks - Key Takeaways

Architecture: Networks of neurons organized in layers; feedforward vs. recurrent
Learning: Supervised learning minimizes loss via gradient descent
Backpropagation: Efficient gradient computation using the chain rule
Optimization: SGD, Adam; regularization (dropout, L2) prevents overfitting
Challenges: Vanishing/exploding gradients addressed by careful design
Applications: Text classification, language modeling, and beyond

Questions?

Coming up next: Transformers and attention mechanisms

Resources:

Goodfellow et al. Deep Learning (free online)
3Blue1Brown neural network videos (visual intuition)
PyTorch tutorials (hands-on practice)