-
Critical Vulnerability: All major dense retrievers (Dragon+, ColBERT, Contriever) drop below 10% accuracy when facing combined biases. They prefer short, repetitive text with query keywords at the beginning — regardless of whether it contains the answer
-
Worse Than Nothing: Poisoned documents cause RAG systems to perform worse than providing no context at all (30.8% vs 64.8% accuracy). This is an exploitable attack vector
-
The Fix. Don’t rely on dense retrieval alone. Use hybrid approaches (dense + BM25 + re-ranking) and implement adversarial testing before production deployment
Research Overview
Dense retrieval models form the backbone of modern Retrieval-Augmented Generation (RAG) systems, which power everything from enterprise search to AI assistants. These models encode documents as vector embeddings, enabling fast similarity-based retrieval across massive knowledge bases. However, this groundbreaking research from UCLA and LMU Munich reveals that these systems harbor critical vulnerabilities that can undermine their reliability.
RAG (Retrieval-Augmented Generation) is a technique where AI systems like ChatGPT or Claude look up relevant information from a database before answering your question. Instead of relying solely on what the AI learned during training, RAG lets it search through documents, company knowledge bases, or other sources to provide more accurate, up-to-date answers. Think of it like giving the AI access to a search engine before it responds.
Vector embeddings are a way to represent text (words, sentences, or documents) as lists of numbers. For example, the sentence “I love pizza” might become [0.2, -0.5, 0.8, …] with hundreds or thousands of numbers. These numbers capture the meaning of the text, so similar sentences have similar numbers.
Dense retrievers are search systems that convert both your query and all documents into these number lists, then find documents whose numbers are closest to your query’s numbers. This allows for “semantic search,” finding documents that mean similar things even if they don’t share exact words.
The research team repurposed the Re-DocRED relation extraction dataset to create controlled experiments that isolate and measure specific biases in popular retrievers including Dragon+, Contriever, COCO-DR, and ColBERT. Their findings have significant implications for anyone building or deploying RAG-based AI systems.
Why This Matters for Business
For organizations deploying RAG systems for knowledge management, customer service, or decision support, these findings highlight critical risks:
- Data Poisoning Vulnerability: Malicious actors could craft documents that exploit these biases to surface incorrect information
- Reliability Concerns: RAG systems may consistently fail to retrieve the most relevant information
- Quality Assurance Challenges: Traditional evaluation metrics may not capture these failure modes
The Five Critical Biases
The researchers identified and quantified five distinct biases that affect dense retrieval models:
Retriever Bias Analysis by Model
Paired t-test statistics comparing bias impact (higher = stronger bias)
1. Brevity Bias (Strongest Impact)
Dense retrievers systematically prefer shorter documents over longer ones, even when the longer document contains the answer. This occurs because pooling mechanisms struggle to maintain signal strength when relevant content is diluted by additional text.
When converting a document to a single embedding (list of numbers), the system must somehow combine all the words. Mean pooling averages all word embeddings together. CLS token pooling uses a special summary token. The problem: when you have a 10-word answer buried in a 1000-word document, averaging dilutes that answer’s signal. Think of it like finding one drop of red dye in a swimming pool.
Paired t-test statistic: 9.46 to 20.51 across models
A t-test measures whether a difference between two groups is real or just random chance. Higher numbers mean stronger, more reliable effects. A t-statistic above 2 is typically significant; above 10 is very strong. When you see “t = 20” here, it means the bias is extremely consistent and not a fluke.
“The additional context in longer documents dilutes the importance of the evidence, causing retrievers to favor concise input.”
2. Literal Bias
Models strongly favor exact string matches over semantically equivalent variations. For example, “NYC” vs “New York City” or “US” vs “United States” are not treated as equivalent, despite representing the same entities.
Paired t-test statistic: 13.31 to 17.18 across models
This bias is particularly problematic for:
- Multi-lingual applications
- Documents with varying naming conventions
- Historical documents with different terminology
3. Position Bias
Information appearing early in documents receives disproportionate attention compared to content appearing later, regardless of relevance. This bias emerges during contrastive pre-training and worsens through fine-tuning.
This is how most embedding models learn. The system is shown pairs of text that should be similar (like a question and its answer) and pairs that should be different. It learns to make similar things have similar numbers and different things have different numbers. The problem: most training data has key information at the beginning, so the model learns to pay more attention there.
Position Bias: Evidence Location Impact
Paired t-test statistics as evidence moves from position 1 to 10 (negative = worse)
The chart above shows how retrieval scores decline as the evidence sentence moves from position 1 to position 10 within a document. All models exhibit severe degradation, with some showing t-statistics below -20 for late-positioned evidence.
4. Repetition Bias
Documents with repeated mentions of query entities receive higher scores, even when the repetition adds no informational value. This creates a vulnerability where verbose, repetitive documents can outrank concise, informative ones.
Repetition vs Brevity: Retrieval Score Heatmap
Contriever MSMARCO scores - lighter = higher (more preferred)
The heatmap reveals a complex interplay between document length and entity repetition. While longer documents are penalized (brevity bias), more entity repetitions increase scores. The optimal strategy for gaming these retrievers would be short documents with repeated entity mentions.
5. Answer Importance (Weakest Signal)
Perhaps most concerning: the actual presence of the answer has relatively weak influence on retrieval scores compared to the superficial biases above. Models are not effectively learning to identify documents that contain query answers.
Paired t-test statistic: -5.92 to 12.69 across models
The unsupervised Contriever model actually shows negative values, indicating it sometimes prefers documents without answers over those containing them.
Experimental Results
Combined Bias Catastrophe
When the researchers combined multiple biases into a single adversarial test, the results were catastrophic. They created “foil documents” that exploit biases but lack the answer, comparing retrieval scores against documents containing the actual evidence.
A foil document is a decoy. It’s crafted to trigger all the retriever’s biases (short length, repeated keywords, query terms at the beginning) but doesn’t actually contain the answer. It’s like a multiple-choice test where option B looks correct because it uses fancy vocabulary, but option C is actually right.
Model Accuracy: Foil vs Evidence Documents
% of cases where model prefers evidence-containing document over biased foil
Key Finding: All models perform below 10% accuracy when facing combined biases. Even advanced models like ColBERT (7.6%) and ReasonIR-8B (8.0%) fail dramatically.
Model Performance Summary
| Model | Accuracy | t-Statistic | p-value |
|---|---|---|---|
| ReasonIR-8B | 8.0% | -36.92 | < 0.01 |
| ColBERT (v2) | 7.6% | -20.96 | < 0.01 |
| COCO-DR Base | 2.4% | -32.92 | < 0.01 |
| Dragon+ | 1.2% | -40.94 | < 0.01 |
| Dragon RoBERTa | 0.8% | -36.53 | < 0.01 |
| Contriever MSMARCO | 0.8% | -42.25 | < 0.01 |
| RetroMAE | 0.4% | -41.49 | < 0.01 |
| Contriever | 0.4% | -34.58 | < 0.01 |
The highly negative t-statistics indicate consistent, strong preference for biased documents over evidence-containing ones.
Impact on RAG Systems
The downstream effects on Retrieval-Augmented Generation are severe. When retrievers feed biased or poisoned documents to LLMs, the entire system’s reliability degrades.
RAG Performance Impact
LLM accuracy when retrievers provide different document types
Critical Findings
- Evidence Documents: When RAG systems receive correct evidence documents, GPT-4o achieves 93.6% accuracy
- No Document Baseline: Without any retrieved context, accuracy drops to 64.8%
- Foil Documents: When retrievers prefer biased documents lacking answers, accuracy falls to 62.8%
- Poisoned Documents: Most concerning—poisoned documents cause accuracy to drop to 30.8%, worse than providing no context at all
This demonstrates that retriever biases don’t just reduce performance—they can actively harm RAG systems by injecting misleading information.
The Poisoning Attack Vector
Data poisoning is when someone intentionally adds bad data to trick an AI system. In this context, attackers craft documents designed to exploit the biases above. The AI retrieves these “poisoned” documents instead of correct ones, causing wrong answers. This is a real security threat for any public-facing RAG system.
The researchers constructed poisoned documents by:
- Creating a foil document exploiting multiple biases
- Adding a sentence with a plausible but incorrect answer
- Testing whether retrievers prefer this poisoned document
Result: Retrievers preferred poisoned documents over evidence-containing ones in 100% of test cases.
Implications for AI Development
For RAG System Builders
BM25 is a traditional keyword-matching algorithm used in search engines since the 1990s. Unlike dense retrievers, it simply counts how often query words appear in documents (with clever adjustments for document length and word rarity). Despite its simplicity, BM25 often outperforms neural approaches for exact keyword matching and doesn’t suffer from the same biases.
- Diversify Retrieval: Don’t rely solely on dense retrievers; consider hybrid approaches combining dense, sparse (BM25), and re-ranking methods
Re-ranking is a second pass over search results. First, a fast method (like embeddings) retrieves the top 100 candidates. Then, a slower but more accurate model (often a cross-encoder or LLM) re-scores just those 100 to pick the best 10. This two-stage approach combines speed with accuracy.
- Implement Robustness Testing: Evaluate retrievers against adversarial examples, not just standard benchmarks like BEIR
Adversarial testing means deliberately trying to break your AI system before attackers do. Instead of only testing with normal inputs, you create worst-case scenarios: documents designed to trick the system, edge cases, and malicious inputs. If your RAG system only passes “happy path” tests, it’s vulnerable in production.
-
Consider Document Preprocessing: Normalize entity representations, balance document lengths, and structure documents with important information distributed throughout
-
Add Verification Layers: Implement secondary checks to verify retrieved documents actually contain relevant answers
For AI Researchers
- Training Data Awareness: MS MARCO fine-tuning may inadvertently amplify position biases present in the dataset
MS MARCO (Microsoft Machine Reading Comprehension) is a massive dataset of real user questions from Bing search. It’s the most popular dataset for training retrieval models. However, because it comes from real search behavior, it contains biases. People tend to put important information at the beginning of web pages, and short answers tend to be favored.
-
Architecture Innovation: Current pooling mechanisms (CLS, mean) are fundamental sources of brevity and position bias
-
Evaluation Methodology: Standard retrieval benchmarks miss these critical failure modes—new adversarial benchmarks are needed
For Enterprise AI Adoption
-
Security Considerations: These biases create attack surfaces for corpus poisoning and adversarial manipulation
-
Quality Assurance: Production RAG systems need monitoring for retrieval quality degradation
-
Vendor Evaluation: When selecting retrieval solutions, evaluate robustness to these documented biases
Business Implications
This paper has critical ramifications for organizations deploying RAG and search systems.
For RAG Product Teams
Reliability Risk Assessment: If your RAG system uses dense retrieval as its sole retrieval method, you have a documented vulnerability. Under 10% accuracy on adversarial cases means your system is exploitable.
Defense in Depth: Implement hybrid retrieval (dense + sparse + re-ranking) before attacks happen, not after. The cost of prevention is lower than the cost of incident response.
Monitoring Requirements: Standard accuracy metrics won’t detect these failures. Build adversarial test suites that specifically probe for brevity, position, and literal biases.
For Enterprise AI Teams
Vendor Evaluation: When evaluating RAG vendors, ask about their retrieval architecture. Single-method dense retrieval is a red flag. Request adversarial testing results.
Security Posture: These biases create attack surfaces. Malicious actors can craft “poisoned” documents that outrank legitimate content. For high-stakes applications (legal, medical, financial), this is a material security concern.
Quality Assurance: Establish retrieval quality gates that include adversarial scenarios. Standard benchmarks like BEIR won’t reveal these vulnerabilities.
For AI Practitioners Building Systems
Hybrid Architecture as Default: Don’t treat hybrid retrieval as a fallback or optimization. Treat it as a baseline requirement for production systems.
Document Preprocessing: Normalize entity representations, structure documents to distribute key information throughout (not just at the beginning), and consider document length standardization.
Verification Layers: Add secondary checks that verify retrieved documents actually contain answers to the query. Don’t trust retrieval scores alone.
For Model Providers and Researchers
Training Data Bias: MS MARCO fine-tuning amplifies position biases present in the dataset. Future training approaches should address this.
Pooling Architecture: CLS and mean pooling are fundamental sources of these biases. Research into alternative aggregation mechanisms is needed.
Evaluation Methodology: Release adversarial benchmarks alongside standard ones. ColDeR (this paper’s dataset) is available on HuggingFace.
For Executives and Decision Makers
Strategic Risk: RAG is increasingly central to enterprise AI strategy. Documented, exploitable vulnerabilities create business risk. Ensure technical teams have budgets to implement defense-in-depth retrieval architectures.
Competitive Consideration: Companies that invest in robust retrieval will have more reliable AI products. This is a potential differentiator in the market.
Methodology Notes
The researchers used the Re-DocRED dataset, which provides:
- Document-level relation extraction annotations
- Evidence sentences containing head and tail entities
- Multiple relation types with query templates
For each bias type, they constructed 250 query-document pairs using paired t-tests to measure statistical significance. This controlled experimental design allows direct comparison of bias impacts across models.
Strengths
- Rigorous statistical methodology with p-values and confidence intervals
- Diverse model evaluation including both CLS and average pooling approaches
- End-to-end RAG impact assessment with GPT-4o
Limitations
- Relies on Re-DocRED annotations, which may contain imperfections
- RAG evaluation uses GPT-4o for both generation and assessment
- Focused on English-language retrieval
Conclusion
This research represents a crucial contribution to understanding dense retrieval vulnerabilities. The finding that all tested models perform below 10% accuracy when facing combined biases should prompt serious reconsideration of how we build and deploy RAG systems.
For practitioners, the immediate takeaway is clear: don’t trust dense retrievers as the sole component of production RAG systems. Implement hybrid retrieval, add verification layers, and monitor for adversarial exploitation.
For researchers, this work opens important directions in bias-robust retrieval model development and adversarial evaluation methodology.
Cite this paper
Mohsen Fayyaz, Ali Modarressi, Hinrich Schütze, Nanyun Peng (2025). Collapse of Dense Retrievers: Short, Early, and Literal Biases Outranking Factual Evidence. ACL 2025.
Available at: https://aclanthology.org/2025.acl-long.447.pdf