On the Theoretical Limitations of Embedding-Based Retrieval

Research Overview

As embedding-based retrieval becomes the backbone of RAG systems and semantic search, this Google DeepMind research delivers a sobering message: vector embeddings have fundamental mathematical limits that no amount of training data or model scaling can overcome.

What are Vector Embeddings?

Vector embeddings convert text into lists of numbers (like [0.2, -0.5, 0.8, ...]). Each number represents some aspect of the text's meaning. When two texts have similar meanings, their number lists are mathematically "close" to each other. This lets computers find similar content by comparing these numbers instead of matching exact words.

The research team proves theoretically that the number of unique document combinations an embedding model can return is bounded by its dimension. They then validate this with the LIMIT benchmark, where models like GritLM and E5-Mistral—despite 4096 dimensions—achieve only 8-19% recall while simple lexical matching (BM25) reaches 85-94%.

What is BM25?

BM25 is a classic keyword-matching algorithm from the 1990s. It scores documents based on how often query words appear, with adjustments for document length and word rarity. While it can't understand synonyms or meaning like embeddings can, it's often surprisingly effective. As this research shows, it doesn't suffer from the same dimensional limits.

Why This Matters

What is Semantic Search?

Traditional search matches exact keywords. Semantic search understands meaning. If you search for "how to fix a flat tire," semantic search also returns results about "changing a punctured wheel" because it understands they mean the same thing. Most modern AI search uses vector embeddings to achieve this.

For anyone building RAG systems, semantic search, or recommendation engines:

Scaling embeddings won't solve everything - There are hard mathematical limits
Simple baselines may outperform - BM25 beat all embedding models on LIMIT
Architecture choices matter - Multi-vector and cross-encoders bypass these limits
Evaluation benchmarks may be misleading - Standard datasets don't stress-test these constraints

Theoretical Foundation

The authors connect embedding retrieval to established mathematical concepts from learning theory and communication complexity. The key insight: retrieval is fundamentally a low-rank matrix factorization problem, and embedding dimension directly constrains how many distinct document combinations can be precisely ranked.

What Does "Low-Rank Matrix Factorization" Mean?

Imagine a spreadsheet where rows are queries and columns are documents, with each cell showing a relevance score. To use embeddings, you need to express this entire spreadsheet using just a few numbers per row and column. The "rank" is basically how complex the patterns in the spreadsheet can be. Low rank means limited expressiveness. This is why embedding dimension matters: it directly limits how many unique patterns you can capture.

This isn't a semantic understanding problem—it's purely combinatorial geometry.

The Core Theorem

For any retrieval task with a binary relevance matrix (query × document), the minimum embedding dimension required equals the sign-rank of that matrix.

Understanding Binary Relevance and Sign-Rank

Binary relevance matrix: A simple yes/no grid. For each query-document pair, either the document is relevant (1) or not (0). No "somewhat relevant" scores.

Sign-rank: The minimum embedding dimension needed to correctly separate all relevant from non-relevant documents for every possible query. Some simple-looking matrices actually require enormous dimensions to represent perfectly.

In simpler terms:

"The number of top-k subsets of documents capable of being returned as the result of some query is limited by the dimension of the embedding."

This isn't a limitation of current models—it's a fundamental constraint on what single-vector representations can express.

Mathematical Formulation

The researchers define two key concepts:

Row-wise Order-Preserving Rank: The minimum dimension needed to preserve the relative ordering of document scores for all queries
Row-wise Thresholdable Rank: The minimum dimension needed to separate relevant from non-relevant documents via a threshold

They prove these are equivalent for binary relevance, connecting to the well-studied sign-rank problem where even simple matrices can require exponentially high dimensions.

Practical Bounds

The dimension-to-capacity relationship follows a cubic polynomial:

Critical-n = −10.53 + 4.03d + 0.052d² + 0.0037d³

Where d is the embedding dimension. This yields approximate limits:

Dimension	Max Documents (Top-2)
512	~500,000
768	~1,700,000
1024	~4,000,000
4096	~250,000,000

Beyond these thresholds, single-vector systems cannot guarantee accurate retrieval for all multi-faceted queries.

Embedding Dimension vs Document Capacity

Maximum unique document combinations for top-2 retrieval (cubic polynomial relationship)

The LIMIT Benchmark

To empirically validate these theoretical limits, the authors created LIMIT (Limitations of Instruction-following Models in Information-Retrieval Tasks).

Dataset Design

50,000 documents: Person profiles with 50 attributes each
1,000 queries: "Who likes [attribute]?" format
46 relevant docs per query: Designed to stress-test combination limits
1,850 unique attributes: Generated via LLM to ensure natural language diversity

The key insight: by controlling the query-relevance density (how many documents are relevant per query), they can directly test theoretical capacity limits.

Why Standard Benchmarks Miss This

Standard retrieval benchmarks have very low qrel density:

What is Qrel Density?

Qrel (query relevance) refers to how many documents are relevant per query. Density is the fraction of all documents that are relevant. Most benchmarks have very sparse relevance, maybe 1-5 documents out of millions are relevant per query. This makes the retrieval task "easy" because models just need to find a needle in a haystack, not distinguish between many similar options.

Dataset	Graph Density	Avg Query Strength
BEIR	0.000-0.025	0.00-0.59
HotpotQA	~0.001	~0.10
LIMIT	0.085	28.47

LIMIT's density is 3-85x higher than standard benchmarks, exposing limitations that typical evaluations never reveal.

Experimental Results

Model Performance on LIMIT

Model Performance on LIMIT Benchmark

BM25 achieves 85-94% while 4096-dim embeddings hit 8-19%

What is Recall@100?

Recall measures what percentage of correct documents your search actually finds. Recall@100 means: "Of all the correct answers, how many appear in the top 100 results?" If recall is 85%, you're finding 85% of correct documents. If it's 8%, you're missing 92% of the right answers.

The results are striking. Despite task simplicity (matching attributes), embedding models collapse:

Model	Dimensions	Recall@100
BM25	High (sparse)	85.7-93.6%
GTE-ModernColBERT	Multi-vector	54.8%
Promptriever	4096	18.9%
GritLM	4096	12.9%
E5-Mistral	4096	8.3%
Qwen3 Embed	4096	4.8%

Dimension Impact

Recall@100 on LIMIT Benchmark

Single-vector embeddings collapse when qrel density is high

Larger embeddings significantly outperform smaller ones, but even 4096-dimensional models hit hard limits when qrel patterns are dense.

Ruling Out Domain Shift

A critical question: Are models failing because LIMIT is "out of distribution"?

What is Domain Shift / Out of Distribution?

When AI models are trained on one type of data (e.g., Wikipedia) but tested on different data (e.g., medical records), they often perform poorly. This is called "domain shift" or being "out of distribution." The skeptic's objection here: maybe LIMIT is just too different from what embedding models were trained on?

The researchers tested this directly:

Condition	Recall@10 Improvement
Train on LIMIT (in-domain)	0-2.8%
Overfit to test set	Near-perfect

Conclusion: Models can learn the task when directly optimized for it, but generalizable embeddings cannot represent the required combinations. This is a capacity problem, not a distribution problem.

Qrel Pattern Effects

The density of relevant documents per query dramatically affects performance:

Sparse qrels (few relevant docs): Models perform reasonably
Dense qrels (many relevant docs): Performance collapses

GritLM showed a 50 absolute point drop in recall@100 when moving from sparse to dense patterns. E5-Mistral experienced 10x performance reduction.

Real-World Pain Points

While LIMIT uses synthetic data, these limitations manifest in practical applications every day:

RAG Systems

Quick Refresher: What is RAG?

RAG (Retrieval-Augmented Generation) gives AI chatbots access to external knowledge. Before answering your question, the AI searches a database for relevant documents, then uses that information in its response. This lets AI provide accurate, up-to-date information instead of relying only on its training data.

Consider a query like "Compare the fiscal policies of FDR and Reagan". A single-vector embedding must compress this multi-faceted intent into one point in space. The result? The embedding averages the intent rather than retrieving distinct relevant documents for both presidents simultaneously.

This explains why RAG systems often return topically related but not precisely relevant documents—the embedding literally cannot represent "I need documents about X AND documents about Y" in a single vector.

E-commerce Search

A query like "Blue trail-running shoes, size 10, under $100" reduces to one point in embedding space. The retrieval system produces results matching some criteria rather than all, because the single vector cannot encode the conjunctive logic required.

Multi-hop Reasoning

Questions requiring evidence from multiple sources face compounded failures. Each "hop" requires retrieving specific document combinations, but the embedding bottleneck prevents the system from maintaining the necessary distinctions.

The Averaging Problem

At its core, single-vector embeddings perform a lossy compression. When a query has multiple distinct information needs, the embedding becomes an average of those needs—and averaging rarely produces optimal results for any individual need.

Alternative Architectures

Not all retrieval architectures suffer these limitations equally:

Single-Vector vs Multi-Vector vs Cross-Encoder: What's the Difference?

Single-vector: One list of numbers per document, one per query. Fast but limited expressiveness. Like summarizing a book in one sentence.

Multi-vector: Multiple lists of numbers per document (often one per token/word). More expressive but requires more storage and computation. Like keeping several key sentences from a book.

Cross-encoder: Feeds both query AND document into the model together, getting a direct relevance score. Most accurate but slowest because it must run the model for every query-document pair. Like having a human read both the question and each document.

Cross-Encoders

Gemini-2.5-Pro (as a cross-encoder) achieved 100% accuracy on LIMIT-small. Cross-encoders score each query-document pair independently, avoiding the representation bottleneck.

Trade-off: Computational cost scales linearly with corpus size, making them impractical for large-scale first-stage retrieval.

Multi-Vector Models

ColBERT-style models that maintain multiple vectors per document significantly outperform single-vector approaches:

GTE-ModernColBERT: 54.8% recall@100 (vs. 8-19% for single-vector)

Why it works: Multiple vectors can capture different aspects of a document, effectively increasing representational capacity.

Sparse Retrieval

BM25's strong performance (85-94%) suggests that high-dimensional sparse representations sidestep the embedding dimension bottleneck entirely.

Implication: Hybrid retrieval (sparse + dense) may be essential for robust systems.

Implications for RAG Systems

The L1/L2 Cache Mental Model

Think of retrieval architectures like CPU caches:

Single-vector embeddings = L1 cache: Fast semantic filtering that handles simple similarity well but has strict capacity limits
Multi-vector/sparse methods = L2 components: Higher capacity for precise combinatorial logic, worth the extra latency

The key insight: stop blindly scaling embedding dimensions. Hybrid search isn't a fallback—it's the correct architecture for compositional queries.

1. Don't Trust Embeddings Blindly

For instruction-following queries that require specific document combinations, embedding retrieval may systematically fail. Consider:

Adding BM25 or other sparse methods as primary retrieval (not just fallback)
Using multi-stage retrieval with re-ranking
Monitoring retrieval quality on representative queries

2. Benchmark Selection Matters

Standard benchmarks with sparse qrels may give false confidence. When evaluating retrieval systems:

Test on queries requiring multiple relevant documents
Measure performance across varying qrel densities
Consider creating domain-specific stress tests

3. Architecture Decisions Have Fundamental Impact

The choice between single-vector, multi-vector, and cross-encoder isn't just about accuracy-speed tradeoffs—it determines what's theoretically possible.

Architecture	Capacity	Speed	Use Case
Single-vector	Limited	Fast	Simple similarity, initial filtering
Multi-vector	Higher	Medium	Nuanced re-ranking, complex queries
Sparse (SPLADE, BM25)	High	Fast	Lexical-semantic signals, conjunctive queries
Cross-encoder	Unlimited	Slow	Final re-ranking, small corpora
Hybrid (Recommended)	Combined	Variable	Production RAG

What is Hybrid Retrieval?

Hybrid retrieval combines multiple methods. Typically, fast embedding search handles initial filtering, followed by BM25 keyword matching and/or re-ranking with more expensive models. The idea: use each method where it excels. Embeddings catch semantic similarity; BM25 catches exact matches; re-rankers do final precision scoring. Most production RAG systems should use some form of hybrid approach.

The recommended approach: composable architectures that use single-vector embeddings for initial semantic filtering, then layer multi-vector or sparse methods for precise combinatorial retrieval.

4. Scaling Laws Don't Apply Here

What are Scaling Laws?

In AI, "scaling laws" refer to the observation that bigger models with more training data generally perform better. This has driven the race to build larger language models. However, this research shows that embedding retrieval doesn't follow the same pattern. You can't just add more dimensions to fix the fundamental limits.

Unlike language modeling where more parameters generally help, embedding retrieval hits hard dimensional limits. A 4096-dim model isn't "better" than 768-dim in a way that scales indefinitely. It just delays hitting the wall.

Business Implications

This paper has significant ramifications for organizations building search and retrieval systems.

For RAG Product Teams

Architecture Investment Priority: Stop optimizing embedding dimensions and start investing in hybrid retrieval. The math proves that single-vector embeddings have hard limits. Multi-vector, sparse, and re-ranking layers are not optional.

Expectation Setting: If your product requires complex compositional queries ("blue trail-running shoes, size 10, under $100"), single-vector retrieval will systematically fail. Set realistic expectations with stakeholders or redesign the architecture.

Benchmark Skepticism: Standard benchmarks hide these limitations because they have sparse relevance patterns. Build internal adversarial test suites that stress-test combinatorial retrieval.

For E-commerce and Marketplace Companies

Multi-Attribute Search Failure: The "averaging problem" explains why semantic search for multi-attribute queries is frustrating. Invest in faceted search, attribute extraction, and structured filtering as complements to semantic retrieval.

Customer Experience Impact: Users searching with specific criteria deserve accurate results. Architectures that fall back to "close enough" semantic matches degrade trust.

Recommendation System Design: Similar limitations apply to recommendation. Single embedding vectors struggle to capture "users who want X AND Y AND NOT Z." Hybrid approaches with explicit attribute handling perform better.

For Enterprise Search Teams

Document Retrieval Strategy: For complex enterprise queries spanning multiple topics, embedding-only search will miss relevant documents. Implement keyword search (BM25) as a primary method, not a fallback.

Scale Considerations: The theoretical limits get worse at scale. More documents in the corpus means more potential combinations to distinguish. What works for 100K documents may fail for 10M.

Cost-Benefit Analysis: Investing in larger embedding models (4096 dimensions vs 768) provides diminishing returns. Those resources are better spent on retrieval architecture diversity.

For AI Infrastructure Teams

Vector Database Limitations: Vector databases are powerful tools, but they inherit embedding limitations. Don't treat vector search as a complete retrieval solution.

Latency vs Accuracy Trade-offs: Multi-vector and re-ranking add latency. Budget for this in system design. The alternative (inaccurate single-vector results) is often worse for user experience.

Monitoring and Alerting: Build observability into retrieval pipelines that detects when embedding-based retrieval is failing on specific query patterns.

For AI Model Vendors

Product Positioning: Be honest about single-vector embedding limitations in documentation. Users deploying these models for complex compositional queries will encounter failures and attribute them to model quality.

Hybrid Tooling: Offer integrated hybrid retrieval solutions, not just embedding endpoints. The market needs end-to-end retrieval systems, not just embedding models.

For Executives and Decision Makers

Technical Debt Recognition: If your AI search is built entirely on single-vector embeddings, you have unaddressed technical debt. The math proves it will fail on certain query classes.

Investment Thesis: Hybrid retrieval isn't a luxury upgrade. It's necessary infrastructure. Budget accordingly.

Methodology Notes

Free Embedding Experiments

To establish theoretical upper bounds, researchers directly optimized query and document vectors:

Optimizer: Adam with InfoNCE loss
Test: All possible top-2 combinations
Stopping criterion: Optimization failure indicates capacity limit

This "free embedding" scenario represents the best possible performance for a given dimension—real models perform worse due to generalization requirements.

LIMIT Dataset Construction

Generated 1,850 attributes via LLM prompting
Created person profiles with 50 attributes each
Formulated queries as "Who likes [attribute]?"
Ensured exactly 46 relevant documents per query (binomial combination)

This construction directly connects to the theoretical framework, making results interpretable through the lens of dimensional capacity.

Conclusion

This research fundamentally changes how we should think about embedding-based retrieval. The limits aren't about model quality or training data. They're mathematical constraints on what finite-dimensional vectors can represent.

For practitioners, the message is clear:

Embedding retrieval has hard limits that scaling won't solve
Standard benchmarks hide these limits due to sparse qrels
Architecture choices matter more than model size for complex queries
Hybrid approaches combining sparse, dense, and re-ranking are likely necessary for robust RAG systems

The LIMIT benchmark and accompanying code provide tools for the community to evaluate and understand these constraints in their own systems.

Additional Reading: The Vector Bottleneck: Limitations of Embedding-Based Retrieval by Shaped.ai

Original paper: arXiv ・ PDF ・ HTML

Authors

Orion WellerGoogle DeepMind, Johns Hopkins University,Michael BoratkoGoogle DeepMind,Iftekhar NaimGoogle DeepMind,Jinhyuk LeeGoogle DeepMind

Code & Data

Cite this paper

Orion Weller, Michael Boratko, Iftekhar Naim, Jinhyuk Lee (2025). On the Theoretical Limitations of Embedding-Based Retrieval. arXiv 2025.

Key Findings