2026-02-19
Throughout this lecture, we’ll follow a concrete scenario: adapting an LLM to serve as a customer support chatbot for an e-commerce company.
State Change: Why Adapt?
Pretrained LLMs are powerful but general-purpose. We need principled methods to specialize them for downstream tasks.
Pretraining objective:
\[ \mathcal{L}_{\text{pretrain}} = -\sum_{t=1}^T \log p_\theta(w_t \mid w_{<t}) \]
Finetuning objective on labeled data \((x_i, y_i)\):
\[ \mathcal{L}_{\text{finetune}} = -\sum_{i=1}^N \log p_{\theta'}(y_i \mid x_i) \]
\[ \theta^* = \arg\min_\theta \mathcal{L}_{\text{task}}(f(x; \theta), y) \]
We often combine mitigations: These are complementary strategies — tuning the learning rate, which layers update, how long you train, and what data you train on.
State Change: Efficient Adaptation
Full finetuning is expensive and risky. PEFT methods update only a small fraction of parameters while achieving competitive performance.

Key idea: Task-specific weight changes lie in a low-dimensional subspace.
\[ W = W_0 + \Delta W, \quad \Delta W \approx BA \]
Initialization: \(A\) is initialized with random Gaussian values; \(B\) is initialized to zero. This ensures \(\Delta W = BA = 0\) at the start of training, so the model begins from exactly the pretrained behavior.
| Method | Parameters per layer | Ratio |
|---|---|---|
| Full finetuning | 16,777,216 | 100% |
| LoRA (r=8) | 65,536 | 0.39% |
| LoRA (r=4) | 32,768 | 0.20% |
QLoRA breakthrough: Dettmers et al. (2023) combined 4-bit quantization with LoRA, enabling finetuning of a 65B-parameter model on a single 48GB GPU — a task that previously required multiple high-end GPUs. This democratized access to large model finetuning.
Setup: A transformer attention layer with \(d = 4096\), rank \(r = 8\)
\[\text{Full FT params per layer} = d^2 = 4096^2 = 16{,}777{,}216\] \[\text{LoRA params per layer} = 2 \times r \times d = 2 \times 8 \times 4096 = 65{,}536\] \[\text{Compression ratio} = \frac{65{,}536}{16{,}777{,}216} = 0.39\%\]
Note: Real transformer layers have rectangular weight matrices (e.g., MLP layers are \(d \times 4d\)). The \(d^2\) figure is illustrative for attention projections where \(d_{\text{model}} = d_{\text{head}} \times n_{\text{heads}}\).
The manifold hypothesis: Natural data lies on low-dimensional manifolds embedded in high-dimensional space. By extension, the task-specific adjustments needed to adapt a pretrained model also occupy a low-dimensional subspace.
| Method | Where it intervenes | Trainable params | Best for |
|---|---|---|---|
| LoRA | Weight matrices (Q, K, V, O, and commonly MLP layers) | 0.1–1% | General default — most tasks |
| Adapters | Bottleneck layers in transformer blocks | 1–5% | Multi-task deployment |
| Prefix Tuning | Learned K/V states prepended per layer | <0.1% | Generation tasks |
| Prompt Tuning | Soft tokens at input embeddings | <0.01% | Scales with model size (Lester et al. 2021: matches finetuning at 10B+ params) |
| QLoRA | 4-bit quantized base + LoRA adapters | 0.1–1% | Consumer-hardware finetuning (Dettmers et al. 2023) |
When the target domain has substantially different vocabulary/distribution (e.g., biomedical, legal, financial), standard SFT may not be enough.
State Change: Post-Training
Post-training transforms base models into capable, aligned assistants. This involves instruction tuning, preference optimization, and — increasingly — RL for reasoning.
SFT objective on instruction-response pairs:
\[ \mathcal{L}_{\mathrm{SFT}} = -\sum_{(x, y) \in \mathcal{D}} \log p_{\theta}(y \mid x) \]
The Power (and Limits) of Small Instruction Datasets
<|user|>: ... <|assistant|>: ...) standardize multi-turn interactionSourced from the phi-3 model card
Sourced from the phi-3 model card
Let’s look at gpt-oss-20b’s prompt template
Step 1: Collect preference pairs \((x, y^+, y^-)\)
Step 2: Train reward model:
\[ \mathbb{P}_\phi(y^+ \succ y^-) = \sigma\big(r_\phi(x, y^+) - r_\phi(x, y^-)\big) \]
Step 3: Optimize policy via PPO:
\[ \max_\theta \; \mathbb{E}_{y \sim \pi_\theta(\cdot|x)} \big[ r_\phi(x, y) \big] \]
Reward Overoptimization (Goodhart’s Law)
Optimizing too aggressively against a proxy reward model degrades true quality. Gao et al. (2023) showed that KL-constrained RL has diminishing and then negative returns — the model learns to exploit reward model blind spots rather than genuinely improve. The KL penalty term in PPO (\(-\beta \, \text{KL}[\pi_\theta \| \pi_{\text{ref}}]\)) mitigates this but doesn’t eliminate it.
DPO directly optimizes preference likelihood:
\[ \mathcal{L}_{\mathrm{DPO}} = -\log \sigma\left(\beta \log \frac{\pi_\theta(y^+|x)}{\pi_{\text{ref}}(y^+|x)} - \beta \log \frac{\pi_\theta(y^-|x)}{\pi_{\text{ref}}(y^-|x)}\right) \]
Why this works: DPO implicitly defines a reward function \(r(x, y) = \beta \log\frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}\). Under the Bradley-Terry preference model, this makes DPO mathematically equivalent to RLHF — it just solves the same optimization in closed form, bypassing the need for an explicit reward model.
What changes in practice?
| RLHF (PPO) | DPO | |
|---|---|---|
| Data needed | Preference pairs + reward model training data | Preference pairs only |
| Infrastructure | 4 models in memory (policy, ref, reward, value) | 2 models (policy, ref) |
| Failure modes | Reward hacking, training instability | Mode collapse if β too low |
| Data freshness | Online (policy-generated) | Offline (fixed dataset) |
| For our chatbot | If you need fine-grained control over helpfulness vs. safety tradeoffs | If you want simpler setup and faster iteration |
RLHF/DPO optimize for human preference (style, helpfulness, safety). A different class of RL post-training optimizes for verifiable correctness (math proofs, code execution, factual accuracy).
GRPO (Group Relative Policy Optimization): Used in DeepSeek-R1. For each prompt, sample a group of responses. Reward = binary (correct/incorrect, verified by execution or ground truth). Policy gradient weighted by advantage relative to group mean. No learned reward model needed.
\[\nabla_\theta J = \mathbb{E}\left[\sum_{i=1}^G \left(\hat{A}_i\right) \nabla_\theta \log \pi_\theta(y_i \mid x)\right]\]
where \(\hat{A}_i = \frac{r_i - \text{mean}(r_{1:G})}{\text{std}(r_{1:G})}\) and \(r_i \in \{0, 1\}\) is verifiable correctness.
Other approaches: GRPO is one of several methods in this space. Others include REINFORCE with baseline, expert iteration (Anthony et al., 2017), STaR (Zelikman et al., 2022), and ReST (Gulcehre et al., 2023). The common thread is using verifiable outcomes as reward signal.
Key idea: Instead of making the model bigger (training-time compute), make the model think longer (inference-time compute). RL post-training teaches the model when and how to allocate extra reasoning steps.
State Change: Theoretical Foundations
We’ve seen what post-training does. Now: why does updating a small fraction of parameters on a small dataset produce such large behavioral shifts?
Bayesian framing of transfer learning:
The pretrained model encodes a prior distribution over functions. Post-training is a Bayesian update — conditioning this prior on task-specific evidence to obtain a posterior.
\[ \underbrace{p(\theta \mid \mathcal{D}_{\text{task}})}_{\text{posterior (post-trained model)}} \;\propto\; \underbrace{p(\mathcal{D}_{\text{task}} \mid \theta)}_{\text{likelihood (task loss)}} \;\cdot\; \underbrace{p(\theta)}_{\text{prior (pretrained model)}} \]
MAP estimation makes the Bayesian connection concrete:
Standard finetuning with weight decay is equivalent to finding the MAP estimate of the posterior:
\[ \theta^* = \arg\max_\theta \underbrace{\log p(\mathcal{D}_{\text{task}} \mid \theta)}_{\text{task loss}} - \underbrace{\lambda \| \theta - \theta_0 \|^2}_{\text{stay close to pretrained } \theta_0} \]
Pretrained LLMs need adaptation for downstream tasks — the pretraining objective doesn’t align with specific applications
Full finetuning updates all parameters but risks catastrophic forgetting; use learning rate scheduling and parameter-efficient methods (LoRA)
LoRA achieves competitive performance by adding low-rank matrices (\(\Delta W \approx BA\)) with <1% of total parameters
Other PEFT methods (adapters, prefix/prompt tuning) offer different trade-offs in parameter count and flexibility
Post-training spans alignment and reasoning: Instruction tuning + RLHF/DPO aligns behavior; outcome-based RL (GRPO) and process reward models improve reasoning capability.
Test-time compute scaling via RL-trained reasoning models represents a new axis of capability improvement beyond scale.
Domain adaptation via continued pretraining specializes models for expert tasks; choose method based on data, compute, and deployment needs
Why it works: Pretraining provides a strong Bayesian prior; post-training is a posterior update. All regularization methods (weight decay, LoRA rank, KL penalty) keep the posterior close to the prior.