-
The Problem. Finetuning LLMs produces strange side effects: models hallucinate more, repeat themselves, and training longer sometimes makes them worse. Nobody understood why.
-
The Solution. A three-term mathematical framework that tracks how training on one example influences predictions on all other examples. It reveals that models "borrow" phrases across unrelated questions and that pushing down bad answers can accidentally suppress good ones too.
-
The Results. ICLR 2025 Outstanding Paper Award. The framework explains hallucination, the "repeater" phenomenon, and why on-policy DPO beats off-policy DPO. It also suggests practical fixes that improve alignment.
Research Overview
Every ML engineer who has finetuned an LLM has encountered the same frustrating pattern: you train the model to give better answers, and it works, but something else breaks. The model starts hallucinating facts it never saw. It repeats the same phrases regardless of the question. And if you train too long, performance actually gets worse.
These aren't bugs in your code. They're fundamental properties of how neural networks learn during finetuning.
This ICLR 2025 Outstanding Paper provides the first unified framework for understanding these dynamics. The authors show that when you train on example A, you're not just changing how the model responds to A. You're changing how it responds to everything else too. And these ripple effects explain most of the strange behaviors we observe.
Finetuning takes a pretrained LLM and further trains it on a smaller, specialized dataset. Common methods include SFT (Supervised Finetuning), where you show the model correct responses, and DPO (Direct Preference Optimization), where you show it pairs of good and bad responses. The goal is to make the model more helpful, harmless, and honest.
The Finetuning Mystery
Before this paper, finetuning was largely a black box. We knew what went in (training data) and what came out (a changed model), but the mechanics in between were opaque.
Consider these puzzling observations:
Hallucination amplification. You finetune a model on factual Q&A pairs. The model gets better at answering those specific questions. But it also starts confidently stating facts that are completely wrong for other questions. Why would training on correct information make the model more wrong?
The repeater phenomenon. After preference tuning, models sometimes start outputting nearly identical responses regardless of the input. Ask about cooking, get the same structure. Ask about coding, same structure. The model learned to prefer certain patterns so strongly that it uses them everywhere.
The DPO cliff. With Direct Preference Optimization, there's a sweet spot. Train too little and the model doesn't learn. Train too long and performance collapses. This isn't overfitting in the traditional sense. Something else is happening.
The authors tackle these mysteries by asking a fundamental question: when we update the model based on one training example, how does that change predictions on every other possible input?
The Three-Term Framework
The paper's core contribution is decomposing the influence of training into three interpretable components:
Three-Term Decomposition Framework
How training on one example influences predictions on another
When you take a gradient step on training example (x, y), the change in the model's prediction on a different input x₀ is approximately:
Δ log π(y | x₀) = -η × A(x₀) × K(x₀, x) × G(x, y)
Each term has a distinct meaning:
A(x₀): The Adjustment Term
This captures the model's current confidence distribution on x₀. If the model is already very confident about certain responses to x₀, this term determines how that confidence gets redistributed.
Think of it as the "shape" of the probability landscape. A peaked distribution (high confidence in one answer) behaves differently than a flat distribution (uncertainty across many answers).
K(x₀, x): The Similarity Kernel
This measures how similar two inputs are in the model's internal representation. High similarity means training on x will strongly affect predictions on x₀. Low similarity means minimal influence.
Think of every training step as a "gradient bomb." The K-term determines the blast radius. If you train the model on a Python coding problem (x), the shockwave travels far, hitting Java and C++ questions (x₀) because the model sees them as "nearby" concepts. The shockwave dissipates before it hits unrelated topics like "French Cooking." Hallucination happens when the model's internal map is wrong—e.g., it thinks "Historical Fiction" is close to "Real History," so the blast radius of a fiction novel accidentally rewrites actual history facts.
The technical term is "empirical Neural Tangent Kernel" (eNTK), but the intuition is simple: questions that activate similar neurons are connected. Change one, and you change both.
G(x, y): The Gradient Signal
This is what the loss function is telling the model to do. For SFT, it pushes up the probability of the correct answer. For DPO, it pushes up preferred responses and pushes down rejected ones. For PPO, it's the reward-weighted policy gradient.
The beauty of this framework is that it works across all finetuning methods. You just swap out the G term for different algorithms.
Why Models Hallucinate
The framework reveals two mechanisms that cause hallucination:
Cross-Question Contamination
Cross-Question Contamination
How the similarity kernel causes hallucination leakage
When you train the model on "Q: What is photosynthesis? A: Photosynthesis is how plants convert sunlight into energy using chlorophyll," something subtle happens.
The gradient doesn't just increase confidence in this answer for this question. It increases confidence in similar phrases for similar questions. If another question has high K similarity (maybe it's also about biology), the model becomes more likely to mention "chlorophyll" and "sunlight" even when they're irrelevant.
The authors put it directly: models "transfer phrases or facts in the response for question B to answer question A." This isn't a bug. It's how neural networks generalize. But when the generalization goes too far, you get hallucination.
The Repeater Effect
After preference tuning, models sometimes learn that certain response structures are always preferred. Short, confident answers. Bullet points. Certain phrases like "I'd be happy to help."
The K term explains why this spreads everywhere. If the model learns that a certain structure is preferred for some questions, and many questions have non-zero K similarity to those training examples, the preference leaks across the entire input space.
The result: a model that sounds confident and helpful but gives the same type of response regardless of what you ask.
The Squeezing Effect
The paper's most surprising finding explains the DPO cliff: why training longer can make performance worse.
The Squeezing Effect in DPO
Why training too long makes even correct answers less likely
In DPO, you show the model pairs of (preferred response, rejected response) and train it to prefer one over the other. The loss pushes up the preferred response and pushes down the rejected one.
Here's the problem: the rejected response is usually something the model already thought was unlikely. It's in a "valley" of the probability distribution. When you apply a negative gradient to something already in a valley, the probability mass doesn't disappear. It has to go somewhere.
Probability is like water in a waterbed—you can't destroy it, you can only move it. In DPO, you push down on the "rejected" answer.
- On-Policy (Good): You push down on a mountain (a high-probability mistake). The water flows naturally into the valleys (the correct answers).
- Off-Policy (Bad): You push down on a flat spot (a mistake the model wouldn't have made anyway). Since there's no "mountain" to flatten, your push creates pressure that unpredictably bulges up the mattress in random places—often inflating wrong answers or hallucinations.
The "squeezing effect" describes where it goes: primarily to whatever response the model already thought was most likely. This might not be the preferred response. It might be something completely different.
As training continues, this effect compounds. The model becomes increasingly confident in its default responses, even at the expense of the responses you're trying to promote.
Why On-Policy Methods Work Better
This framework explains a known empirical result: on-policy DPO variants (where you generate fresh rejected samples during training) outperform off-policy DPO (where you use a fixed dataset of rejected samples).
Think of it like studying for an exam:
-
Off-policy is like studying with an old answer key from last year. The "wrong answers" you're learning to avoid might be mistakes you'd never make anyway. Pushing down answers you already know are wrong doesn't help much, and it can squeeze probability toward your existing (possibly incorrect) default answers.
-
On-policy is like having a teacher grade your current attempts and give fresh feedback. The "wrong answers" are mistakes you're actually making right now. Correcting these redistributes your knowledge more usefully.
With on-policy methods, the rejected samples are responses the model currently thinks are plausible. Pushing them down redistributes probability more evenly. With off-policy methods, you're often pushing down responses the model already considers unlikely, triggering the squeezing effect.
The Waterbed Effect
Probability can't be destroyed, only moved
The consequence of this difference shows up clearly in training curves. Off-policy methods hit a performance ceiling and then decline as the squeezing effect compounds. On-policy methods continue improving because each correction actually moves probability where you want it.
On-Policy vs Off-Policy DPO
Why generating fresh rejected samples during training works better
Practical Applications
For ML Engineers
1. Monitor confidence distributions during training. Don't just track loss. Track how confident the model is on held-out examples. If confidence on correct answers is dropping, you're likely seeing the squeezing effect.
2. Use smaller learning rates for DPO than SFT. The paper notes this is already common practice, but now there's a theoretical reason: smaller steps reduce the squeezing effect by limiting how much probability mass moves per update.
3. Prefer on-policy methods when possible. Generate fresh rejected samples during training rather than using a fixed dataset. This keeps the rejected samples in regions where pushing them down has beneficial redistribution effects.
4. Watch for cross-contamination. If you're finetuning on domain-specific data, test on adjacent domains. Hallucination often appears in related but distinct areas first.
For Alignment Researchers
The framework provides a lens for understanding why certain alignment techniques work or fail. If a method is accidentally triggering the squeezing effect, it might appear to align the model while actually making it more confident in unaligned responses.
For Production Systems
The "repeater" phenomenon is particularly relevant for deployed systems. A model that passes your test suite but produces repetitive responses in production has likely overfit to structural preferences in the training data.
Implementation Blueprint
This section is for ML engineers and researchers currently managing SFT or DPO pipelines. Based on the paper's findings, you should adjust your training configuration and monitoring to detect the "squeezing effect" before it degrades your model.
Phase 1: Configuration (Before you start)
The most critical decision is choosing your alignment algorithm and hyperparameters. The paper strongly favors on-policy methods to avoid the mathematical pitfalls of fixed rejection datasets.
| Parameter/Choice | Recommendation | Why it matters |
|---|---|---|
| Algorithm | On-Policy DPO (or Online DPO) | Avoids pushing down "impossible" answers, which triggers the squeezing effect. |
| Learning Rate | 0.1x of SFT LR | DPO is unstable. Smaller steps reduce the probability mass displacement per update. |
| Batch Size | Larger is safer | Noisier gradients from small batches can exacerbate the squeezing effect. |
| Refusal Data | Mix across domains | Don't just train on one type of refusal; cross-domain similarity (K-kernel) causes over-generalization. |
Phase 2: Monitoring (During training)
Standard loss curves hide the "squeezing effect." You need to track the model's confidence on a held-out set of correct answers.
The Warning Sign: If your validation loss is flat but the average confidence on correct answers starts dropping, stop training immediately. You have hit the "DPO Cliff."
# Add this hook to your training loop (e.g., HuggingFace Trainer callback)
def monitor_squeezing_effect(model, eval_dataset, threshold=0.95):
"""
Run this every N steps.
If 'correct_conf' drops significantly while 'loss' is stable,
the model is undergoing the squeezing effect.
"""
model.eval()
correct_confidences = []
with torch.no_grad():
for batch in eval_dataset:
# We only care about the probability assigned to the CORRECT token
# not just the general perplexity
logits = model(batch['input_ids'])
probs = torch.softmax(logits, dim=-1)
# Extract probability of the target label
target_prob = gather_probabilities(probs, batch['labels'])
correct_confidences.append(target_prob.mean().item())
avg_conf = np.mean(correct_confidences)
# Log to WandB / Tensorboard
log_metric("avg_correct_confidence", avg_conf)
return avg_confPhase 3: Evaluation (Post-training)
After training, you need to check for the "Repeater Phenomenon"—where the model overfits to a specific response structure (like starting every answer with "Sure!").
The Repeater Test: Run the model on 50 diverse, unrelated prompts (coding, cooking, history). Calculate the cosine similarity of the embeddings of the responses.
def detect_repeater_phenomenon(model, diverse_prompts):
"""
If the model answers diverse questions with semantically identical
structures, the similarity score will spike.
"""
responses = model.generate(diverse_prompts)
embeddings = get_embeddings(responses) # Use any embedding model
# Calculate average pairwise similarity
similarity_matrix = cosine_similarity(embeddings)
avg_sim = similarity_matrix.mean()
if avg_sim > 0.5: # Threshold depends on embedding model
print("WARNING: Model is repeating output structures.")
print("Recommendation: Reduce DPO epochs or diversify training data.")
return avg_simLimitations
Assumes stable NTK. The framework relies on the empirical Neural Tangent Kernel remaining relatively stable during training. This assumption weakens for very long training runs or aggressive learning rates.
Qualitative explanations. While the framework explains why phenomena occur, it doesn't provide precise quantitative predictions for when the squeezing effect will dominate or how much hallucination to expect.
Computational cost. Computing the full K matrix for large datasets is expensive. The practical recommendations work around this, but a full analysis requires significant compute.
Focus on single-step dynamics. The analysis primarily considers how one gradient step affects predictions. Cumulative effects over many steps are discussed but not fully characterized.
Business Implications
ROI & Cost Efficiency
Stop wasting GPU hours. The "DPO Cliff" finding proves that training longer destroys value. It is not a case of diminishing returns. It is a case of active degradation. By implementing the confidence monitoring suggested in the Blueprint, teams can cut training runs short the moment the "squeezing effect" begins.
- Impact: 20-30% reduction in compute costs for alignment runs.
- Metric: "Compute-to-Quality Ratio" (stopping when quality degrades).
Competitive Advantage & Risk
The "Hallucination Moat." Most companies simply finetune on Q&A pairs and hope for the best. Understanding "Cross-Question Contamination" allows you to build cleaner datasets that prevent your model from confidently stating wrong facts in production.
- Risk: A chatbot that "hallucinates confidently" is a legal and brand liability.
- Solution: Moving to On-Policy DPO is a quality differentiator. It produces safer, more reliable products than competitors using standard off-policy pipelines.
Strategic Planning
From "More Data" to "Better Dynamics." The prevailing strategy has been "add more data." This paper shows that adding data can hurt if it has high K-similarity to unrelated concepts (causing contamination).
- Strategy: Shift investment from volume labeling to diversity analysis.
- Action: direct your data teams to map the "semantic coverage" of your dataset rather than just counting rows.
Code: GitHub Repository
Recognition: ICLR 2025 Oral, Outstanding Paper Award
Cite this paper
Yi Ren, Danica J. Sutherland (2025). Learning Dynamics of LLM Finetuning: Why Your Model Hallucinates and Forgets. ICLR 2025.