Research Overview
As embedding-based retrieval becomes the backbone of RAG systems and semantic search, this Google DeepMind research delivers a sobering message: vector embeddings have fundamental mathematical limits that no amount of training data or model scaling can overcome.
Vector embeddings convert text into lists of numbers (like [0.2, -0.5, 0.8, ...]). Each number represents some aspect of the text's meaning. When two texts have similar meanings, their number lists are mathematically "close" to each other. This lets computers find similar content by comparing these numbers instead of matching exact words.
The research team proves theoretically that the number of unique document combinations an embedding model can return is bounded by its dimension. They then validate this with the LIMIT benchmark, where models like GritLM and E5-Mistral—despite 4096 dimensions—achieve only 8-19% recall while simple lexical matching (BM25) reaches 85-94%.
BM25 is a classic keyword-matching algorithm from the 1990s. It scores documents based on how often query words appear, with adjustments for document length and word rarity. While it can't understand synonyms or meaning like embeddings can, it's often surprisingly effective. As this research shows, it doesn't suffer from the same dimensional limits.
Why This Matters
Traditional search matches exact keywords. Semantic search understands meaning. If you search for "how to fix a flat tire," semantic search also returns results about "changing a punctured wheel" because it understands they mean the same thing. Most modern AI search uses vector embeddings to achieve this.
For anyone building RAG systems, semantic search, or recommendation engines:
- Scaling embeddings won't solve everything - There are hard mathematical limits
- Simple baselines may outperform - BM25 beat all embedding models on LIMIT
- Architecture choices matter - Multi-vector and cross-encoders bypass these limits
- Evaluation benchmarks may be misleading - Standard datasets don't stress-test these constraints
Theoretical Foundation
The authors connect embedding retrieval to established mathematical concepts from learning theory and communication complexity. The key insight: retrieval is fundamentally a low-rank matrix factorization problem, and embedding dimension directly constrains how many distinct document combinations can be precisely ranked.
Imagine a spreadsheet where rows are queries and columns are documents, with each cell showing a relevance score. To use embeddings, you need to express this entire spreadsheet using just a few numbers per row and column. The "rank" is basically how complex the patterns in the spreadsheet can be. Low rank means limited expressiveness. This is why embedding dimension matters: it directly limits how many unique patterns you can capture.
This isn't a semantic understanding problem—it's purely combinatorial geometry.
The Core Theorem
For any retrieval task with a binary relevance matrix (query × document), the minimum embedding dimension required equals the sign-rank of that matrix.
Binary relevance matrix: A simple yes/no grid. For each query-document pair, either the document is relevant (1) or not (0). No "somewhat relevant" scores.
Sign-rank: The minimum embedding dimension needed to correctly separate all relevant from non-relevant documents for every possible query. Some simple-looking matrices actually require enormous dimensions to represent perfectly.
In simpler terms:
"The number of top-k subsets of documents capable of being returned as the result of some query is limited by the dimension of the embedding."
This isn't a limitation of current models—it's a fundamental constraint on what single-vector representations can express.
Mathematical Formulation
The researchers define two key concepts:
-
Row-wise Order-Preserving Rank: The minimum dimension needed to preserve the relative ordering of document scores for all queries
-
Row-wise Thresholdable Rank: The minimum dimension needed to separate relevant from non-relevant documents via a threshold
They prove these are equivalent for binary relevance, connecting to the well-studied sign-rank problem where even simple matrices can require exponentially high dimensions.
Practical Bounds
The dimension-to-capacity relationship follows a cubic polynomial:
Critical-n = −10.53 + 4.03d + 0.052d² + 0.0037d³
Where d is the embedding dimension. This yields approximate limits:
| Dimension | Max Documents (Top-2) |
|---|---|
| 512 | ~500,000 |
| 768 | ~1,700,000 |
| 1024 | ~4,000,000 |
| 4096 | ~250,000,000 |
Beyond these thresholds, single-vector systems cannot guarantee accurate retrieval for all multi-faceted queries.
Embedding Dimension vs Document Capacity
Maximum unique document combinations for top-2 retrieval (cubic polynomial relationship)
The LIMIT Benchmark
To empirically validate these theoretical limits, the authors created LIMIT (Limitations of Instruction-following Models in Information-Retrieval Tasks).
Dataset Design
- 50,000 documents: Person profiles with 50 attributes each
- 1,000 queries: "Who likes [attribute]?" format
- 46 relevant docs per query: Designed to stress-test combination limits
- 1,850 unique attributes: Generated via LLM to ensure natural language diversity
The key insight: by controlling the query-relevance density (how many documents are relevant per query), they can directly test theoretical capacity limits.
Why Standard Benchmarks Miss This
Standard retrieval benchmarks have very low qrel density:
Qrel (query relevance) refers to how many documents are relevant per query. Density is the fraction of all documents that are relevant. Most benchmarks have very sparse relevance, maybe 1-5 documents out of millions are relevant per query. This makes the retrieval task "easy" because models just need to find a needle in a haystack, not distinguish between many similar options.
| Dataset | Graph Density | Avg Query Strength |
|---|---|---|
| BEIR | 0.000-0.025 | 0.00-0.59 |
| HotpotQA | ~0.001 | ~0.10 |
| LIMIT | 0.085 | 28.47 |
LIMIT's density is 3-85x higher than standard benchmarks, exposing limitations that typical evaluations never reveal.
Experimental Results
Model Performance on LIMIT
Model Performance on LIMIT Benchmark
BM25 achieves 85-94% while 4096-dim embeddings hit 8-19%
Recall measures what percentage of correct documents your search actually finds. Recall@100 means: "Of all the correct answers, how many appear in the top 100 results?" If recall is 85%, you're finding 85% of correct documents. If it's 8%, you're missing 92% of the right answers.
The results are striking. Despite task simplicity (matching attributes), embedding models collapse:
| Model | Dimensions | Recall@100 |
|---|---|---|
| BM25 | High (sparse) | 85.7-93.6% |
| GTE-ModernColBERT | Multi-vector | 54.8% |
| Promptriever | 4096 | 18.9% |
| GritLM | 4096 | 12.9% |
| E5-Mistral | 4096 | 8.3% |
| Qwen3 Embed | 4096 | 4.8% |
Dimension Impact
Recall@100 on LIMIT Benchmark
Single-vector embeddings collapse when qrel density is high
Larger embeddings significantly outperform smaller ones, but even 4096-dimensional models hit hard limits when qrel patterns are dense.
Ruling Out Domain Shift
A critical question: Are models failing because LIMIT is "out of distribution"?
When AI models are trained on one type of data (e.g., Wikipedia) but tested on different data (e.g., medical records), they often perform poorly. This is called "domain shift" or being "out of distribution." The skeptic's objection here: maybe LIMIT is just too different from what embedding models were trained on?
The researchers tested this directly:
| Condition | Recall@10 Improvement |
|---|---|
| Train on LIMIT (in-domain) | 0-2.8% |
| Overfit to test set | Near-perfect |
Conclusion: Models can learn the task when directly optimized for it, but generalizable embeddings cannot represent the required combinations. This is a capacity problem, not a distribution problem.
Qrel Pattern Effects
The density of relevant documents per query dramatically affects performance:
- Sparse qrels (few relevant docs): Models perform reasonably
- Dense qrels (many relevant docs): Performance collapses
GritLM showed a 50 absolute point drop in recall@100 when moving from sparse to dense patterns. E5-Mistral experienced 10x performance reduction.
Real-World Pain Points
While LIMIT uses synthetic data, these limitations manifest in practical applications every day:
RAG Systems
RAG (Retrieval-Augmented Generation) gives AI chatbots access to external knowledge. Before answering your question, the AI searches a database for relevant documents, then uses that information in its response. This lets AI provide accurate, up-to-date information instead of relying only on its training data.
Consider a query like "Compare the fiscal policies of FDR and Reagan". A single-vector embedding must compress this multi-faceted intent into one point in space. The result? The embedding averages the intent rather than retrieving distinct relevant documents for both presidents simultaneously.
This explains why RAG systems often return topically related but not precisely relevant documents—the embedding literally cannot represent "I need documents about X AND documents about Y" in a single vector.
E-commerce Search
A query like "Blue trail-running shoes, size 10, under $100" reduces to one point in embedding space. The retrieval system produces results matching some criteria rather than all, because the single vector cannot encode the conjunctive logic required.
Multi-hop Reasoning
Questions requiring evidence from multiple sources face compounded failures. Each "hop" requires retrieving specific document combinations, but the embedding bottleneck prevents the system from maintaining the necessary distinctions.
The Averaging Problem
At its core, single-vector embeddings perform a lossy compression. When a query has multiple distinct information needs, the embedding becomes an average of those needs—and averaging rarely produces optimal results for any individual need.
Alternative Architectures
Not all retrieval architectures suffer these limitations equally:
Single-vector: One list of numbers per document, one per query. Fast but limited expressiveness. Like summarizing a book in one sentence.
Multi-vector: Multiple lists of numbers per document (often one per token/word). More expressive but requires more storage and computation. Like keeping several key sentences from a book.
Cross-encoder: Feeds both query AND document into the model together, getting a direct relevance score. Most accurate but slowest because it must run the model for every query-document pair. Like having a human read both the question and each document.
Cross-Encoders
Gemini-2.5-Pro (as a cross-encoder) achieved 100% accuracy on LIMIT-small. Cross-encoders score each query-document pair independently, avoiding the representation bottleneck.
Trade-off: Computational cost scales linearly with corpus size, making them impractical for large-scale first-stage retrieval.
Multi-Vector Models
ColBERT-style models that maintain multiple vectors per document significantly outperform single-vector approaches:
- GTE-ModernColBERT: 54.8% recall@100 (vs. 8-19% for single-vector)
Why it works: Multiple vectors can capture different aspects of a document, effectively increasing representational capacity.
Sparse Retrieval
BM25's strong performance (85-94%) suggests that high-dimensional sparse representations sidestep the embedding dimension bottleneck entirely.
Implication: Hybrid retrieval (sparse + dense) may be essential for robust systems.
Implications for RAG Systems
The L1/L2 Cache Mental Model
Think of retrieval architectures like CPU caches:
- Single-vector embeddings = L1 cache: Fast semantic filtering that handles simple similarity well but has strict capacity limits
- Multi-vector/sparse methods = L2 components: Higher capacity for precise combinatorial logic, worth the extra latency
The key insight: stop blindly scaling embedding dimensions. Hybrid search isn't a fallback—it's the correct architecture for compositional queries.
1. Don't Trust Embeddings Blindly
For instruction-following queries that require specific document combinations, embedding retrieval may systematically fail. Consider:
- Adding BM25 or other sparse methods as primary retrieval (not just fallback)
- Using multi-stage retrieval with re-ranking
- Monitoring retrieval quality on representative queries
2. Benchmark Selection Matters
Standard benchmarks with sparse qrels may give false confidence. When evaluating retrieval systems:
- Test on queries requiring multiple relevant documents
- Measure performance across varying qrel densities
- Consider creating domain-specific stress tests
3. Architecture Decisions Have Fundamental Impact
The choice between single-vector, multi-vector, and cross-encoder isn't just about accuracy-speed tradeoffs—it determines what's theoretically possible.
| Architecture | Capacity | Speed | Use Case |
|---|---|---|---|
| Single-vector | Limited | Fast | Simple similarity, initial filtering |
| Multi-vector | Higher | Medium | Nuanced re-ranking, complex queries |
| Sparse (SPLADE, BM25) | High | Fast | Lexical-semantic signals, conjunctive queries |
| Cross-encoder | Unlimited | Slow | Final re-ranking, small corpora |
| Hybrid (Recommended) | Combined | Variable | Production RAG |
Hybrid retrieval combines multiple methods. Typically, fast embedding search handles initial filtering, followed by BM25 keyword matching and/or re-ranking with more expensive models. The idea: use each method where it excels. Embeddings catch semantic similarity; BM25 catches exact matches; re-rankers do final precision scoring. Most production RAG systems should use some form of hybrid approach.
The recommended approach: composable architectures that use single-vector embeddings for initial semantic filtering, then layer multi-vector or sparse methods for precise combinatorial retrieval.
4. Scaling Laws Don't Apply Here
In AI, "scaling laws" refer to the observation that bigger models with more training data generally perform better. This has driven the race to build larger language models. However, this research shows that embedding retrieval doesn't follow the same pattern. You can't just add more dimensions to fix the fundamental limits.
Unlike language modeling where more parameters generally help, embedding retrieval hits hard dimensional limits. A 4096-dim model isn't "better" than 768-dim in a way that scales indefinitely. It just delays hitting the wall.
Business Implications
This paper has significant ramifications for organizations building search and retrieval systems.
For RAG Product Teams
Architecture Investment Priority: Stop optimizing embedding dimensions and start investing in hybrid retrieval. The math proves that single-vector embeddings have hard limits. Multi-vector, sparse, and re-ranking layers are not optional.
Expectation Setting: If your product requires complex compositional queries ("blue trail-running shoes, size 10, under $100"), single-vector retrieval will systematically fail. Set realistic expectations with stakeholders or redesign the architecture.
Benchmark Skepticism: Standard benchmarks hide these limitations because they have sparse relevance patterns. Build internal adversarial test suites that stress-test combinatorial retrieval.
For E-commerce and Marketplace Companies
Multi-Attribute Search Failure: The "averaging problem" explains why semantic search for multi-attribute queries is frustrating. Invest in faceted search, attribute extraction, and structured filtering as complements to semantic retrieval.
Customer Experience Impact: Users searching with specific criteria deserve accurate results. Architectures that fall back to "close enough" semantic matches degrade trust.
Recommendation System Design: Similar limitations apply to recommendation. Single embedding vectors struggle to capture "users who want X AND Y AND NOT Z." Hybrid approaches with explicit attribute handling perform better.
For Enterprise Search Teams
Document Retrieval Strategy: For complex enterprise queries spanning multiple topics, embedding-only search will miss relevant documents. Implement keyword search (BM25) as a primary method, not a fallback.
Scale Considerations: The theoretical limits get worse at scale. More documents in the corpus means more potential combinations to distinguish. What works for 100K documents may fail for 10M.
Cost-Benefit Analysis: Investing in larger embedding models (4096 dimensions vs 768) provides diminishing returns. Those resources are better spent on retrieval architecture diversity.
For AI Infrastructure Teams
Vector Database Limitations: Vector databases are powerful tools, but they inherit embedding limitations. Don't treat vector search as a complete retrieval solution.
Latency vs Accuracy Trade-offs: Multi-vector and re-ranking add latency. Budget for this in system design. The alternative (inaccurate single-vector results) is often worse for user experience.
Monitoring and Alerting: Build observability into retrieval pipelines that detects when embedding-based retrieval is failing on specific query patterns.
For AI Model Vendors
Product Positioning: Be honest about single-vector embedding limitations in documentation. Users deploying these models for complex compositional queries will encounter failures and attribute them to model quality.
Hybrid Tooling: Offer integrated hybrid retrieval solutions, not just embedding endpoints. The market needs end-to-end retrieval systems, not just embedding models.
For Executives and Decision Makers
Technical Debt Recognition: If your AI search is built entirely on single-vector embeddings, you have unaddressed technical debt. The math proves it will fail on certain query classes.
Investment Thesis: Hybrid retrieval isn't a luxury upgrade. It's necessary infrastructure. Budget accordingly.
Methodology Notes
Free Embedding Experiments
To establish theoretical upper bounds, researchers directly optimized query and document vectors:
- Optimizer: Adam with InfoNCE loss
- Test: All possible top-2 combinations
- Stopping criterion: Optimization failure indicates capacity limit
This "free embedding" scenario represents the best possible performance for a given dimension—real models perform worse due to generalization requirements.
LIMIT Dataset Construction
- Generated 1,850 attributes via LLM prompting
- Created person profiles with 50 attributes each
- Formulated queries as "Who likes [attribute]?"
- Ensured exactly 46 relevant documents per query (binomial combination)
This construction directly connects to the theoretical framework, making results interpretable through the lens of dimensional capacity.
Conclusion
This research fundamentally changes how we should think about embedding-based retrieval. The limits aren't about model quality or training data. They're mathematical constraints on what finite-dimensional vectors can represent.
For practitioners, the message is clear:
- Embedding retrieval has hard limits that scaling won't solve
- Standard benchmarks hide these limits due to sparse qrels
- Architecture choices matter more than model size for complex queries
- Hybrid approaches combining sparse, dense, and re-ranking are likely necessary for robust RAG systems
The LIMIT benchmark and accompanying code provide tools for the community to evaluate and understand these constraints in their own systems.
Additional Reading: The Vector Bottleneck: Limitations of Embedding-Based Retrieval by Shaped.ai
Original paper: arXiv ・ PDF ・ HTML
Cite this paper
Orion Weller, Michael Boratko, Iftekhar Naim, Jinhyuk Lee (2025). On the Theoretical Limitations of Embedding-Based Retrieval. arXiv 2025.