-
The Problem. RAG benchmarks focus on text, but production documents contain charts, tables, and diagrams. Existing evaluations miss the multimodal reality of enterprise documents.
-
The Benchmark. ViDoRe V3 covers 26K document pages across 10 professional domains with 3K human-verified queries in 6 languages. Every query includes retrieval relevance, bounding box localization, and verified reference answers.
-
The Findings. Visual retrievers beat text-only by 8+ points. Textual rerankers deliver 66x more improvement than visual ones. And current VLMs fail at visual grounding with an 85% gap to human performance.
Research Overview
If you have built a RAG system for real documents, you know the problem: your benchmark numbers look great, but users complain the system cannot find information in charts and tables. The disconnect is not your implementation. It is the benchmark.
Most RAG evaluations test text retrieval on text-heavy documents. They measure whether the system can find a paragraph that answers a question. But professional documents are not just paragraphs. Financial reports have revenue charts. Technical manuals have architecture diagrams. Research papers have results tables. When your benchmark ignores these, you are optimizing for the wrong thing.
A RAG benchmark is a standardized test for retrieval-augmented generation systems. It provides a set of documents, queries, and correct answers. You run your system on the queries, measure how often it retrieves the right documents and generates correct answers, then compare against other systems. Good benchmarks reflect real-world usage patterns.
ViDoRe V3 addresses this gap. It is a comprehensive multimodal RAG benchmark that tests retrieval, answer generation, and visual grounding on documents that actually look like what enterprises deal with: PDFs with charts, tables, images, and complex layouts.
The benchmark comes from a collaboration between Illuin Technology, NVIDIA, and CentraleSupelec. They invested 12,000 hours of human annotation to create something that existing benchmarks lack: ground truth for multimodal document understanding.
ViDoRe V3 Benchmark at a Glance
Human-verified multimodal RAG evaluation at scale
What makes it different
| Feature | Typical Benchmarks | ViDoRe V3 |
|---|---|---|
| Content | Text-only | Text, tables, charts, images |
| Annotation | Automated or crowd | Expert with 12K hours |
| Grounding | None | Bounding boxes per answer |
| Languages | English | 6 languages |
| Domains | General | 10 professional domains |
| Scale | Varies | 26K pages, 3K queries |
The benchmark evaluates three capabilities that matter for production RAG: Can the system find the right pages? Can it generate correct answers? Can it point to where in the document the answer came from?
The Benchmark
ViDoRe V3 covers 10 document corpora spanning Finance, Computer Science, Energy, Pharmaceuticals, Human Resources, Industrial Maintenance, Telecom, and Physics. Seven corpora are English, three are French.
These are the domains where multimodal documents are common and RAG systems are deployed. A financial analyst needs to query earnings reports with charts. A maintenance technician needs to find procedures with diagrams. The benchmark tests what practitioners actually need.
Query taxonomy
The benchmark uses a dual-axis classification for queries:
Query types (what the user needs):
- Extractive: Find a specific fact
- Numerical: Find a number or calculation
- Boolean: Yes/no questions
- Open-ended: Synthesis or explanation
- Multi-hop: Requires combining information
- Compare-contrast: Analyze differences
- Enumerative: List multiple items
Query formats (how the user asks):
- Question: Natural language question
- Keyword: Search-style terms
- Instruction: Task-oriented request
This taxonomy enables fine-grained analysis. You can see that your system handles extractive questions well but struggles with multi-hop reasoning.
Annotation methodology
The 12,000 hours of annotation produced three layers of ground truth:
-
Retrieval relevance. For each query, which pages contain relevant information? Annotators rated pages as Not Relevant, Partially Relevant, or Fully Relevant.
-
Bounding boxes. On relevant pages, exactly where is the supporting evidence? Annotators drew boxes around the specific text, table cells, or chart regions.
-
Reference answers. What is the correct answer, grounded in the document? Multiple annotators wrote answers, which were then merged into consensus responses.
This multi-layer annotation is what sets ViDoRe V3 apart. Most benchmarks stop at retrieval relevance. The bounding boxes enable evaluation of visual grounding, a capability that existing benchmarks cannot measure.
Visual vs Textual Retrieval
The first major finding: visual retrievers consistently outperform text-only approaches.
Visual Retrievers Outperform Textual Ones
NDCG@10 retrieval performance across 10 datasets
The best visual retriever (ColEmbed-3B-v2) achieves 59.8 NDCG@10 compared to 51.0 for the best textual retriever (Qwen3-8B). That is an 8.8 point gap.
Normalized Discounted Cumulative Gain at rank 10. It measures how well the top 10 retrieved results match the ground truth, with higher-ranked correct results counting more. A score of 100 means perfect ranking. Scores above 50 are generally considered good for document retrieval tasks.
Why visual retrievers win
Visual retrievers process page images directly. They see the layout, the charts, the table structures. Textual retrievers work from extracted text, which loses this structural information.
Think of a city planner looking at a satellite map versus a tourist reading a list of street names. The planner can instantly spot where the river bends, where the park lies, and how neighborhoods connect. The tourist has addresses but no sense of geography. Visual retrievers are the planner: they see the whole page and understand spatial relationships. Textual retrievers are the tourist: they have the words but have lost the layout.
Consider a query about "Q3 revenue trends." The textual retriever might find a paragraph mentioning revenue. The visual retriever sees the actual chart showing the trend line. It matches the query to visual evidence that the text extraction missed entirely.
This advantage holds across model sizes. A 3B visual retriever beats an 8B textual retriever. The modality matters more than the parameter count.
Late interaction beats dense embeddings
Within visual retrievers, late interaction models (ColEmbed, ColQwen) outperform single-vector dense embeddings (Nomic).
| Model Type | Best Score |
|---|---|
| Late interaction visual | 59.8 |
| Dense visual | 49.0 |
| Late interaction textual | 51.0 |
| Dense textual | 44.9 |
| BM25 (keyword) | 20.3 |
BM25 is a classic keyword-based ranking algorithm that scores documents by term frequency and inverse document frequency. It has been the standard baseline for search engines for decades. The low score here (20.3) shows why neural retrievers have become essential for complex document queries.
Late interaction models keep separate embeddings for each token or image patch, then compute fine-grained similarity at query time. Dense embeddings compress everything into a single vector, losing detail. Late interaction is like keeping every book on individual shelves versus writing a one-page summary of the entire library.
Late interaction models maintain separate embeddings for each token/patch and compute similarity at query time. This preserves more information than compressing everything into a single vector.
The Reranker Surprise
The second major finding caught the researchers off guard: textual rerankers massively outperform visual ones.
Textual Rerankers Deliver Massive Gains
Adding a reranker to retrieval pipelines (NDCG@10)
Adding a textual reranker (zerank-2) to a textual retrieval pipeline improves NDCG@10 by +13.2 points. Adding a visual reranker (jina-reranker-m0) to a visual pipeline improves by only +0.2 points.
That is a 66x difference in reranker impact.
What this means for pipeline design
The optimal pipeline is not what you might expect:
| Pipeline | Final Score |
|---|---|
| Visual retriever + textual reranker | 63.6 |
| Visual retriever + visual reranker | 57.8 |
| Textual retriever + textual reranker | 63.6 |
| Textual retriever only | 50.4 |
The best results come from combining visual retrieval with textual reranking. The visual retriever finds candidates that text extraction would miss. The textual reranker, working from extracted text, provides superior ranking.
The paper does not fully explain this gap, but likely factors include: textual rerankers have more training data, text provides clearer semantic signals for ranking, and visual rerankers may struggle with the diverse document layouts in the benchmark. This is an area for future research.
The practical implication
If you are building a RAG pipeline for multimodal documents:
- Use a visual retriever for the initial candidate selection
- Add a textual reranker for the final ranking
- Do not assume visual rerankers will help just because you have visual content
The +13.2 point gain from textual reranking is one of the largest improvements available in the RAG pipeline. It costs only reranker inference on your top-k candidates.
Query Complexity Matters
Not all queries are created equal. ViDoRe V3 reveals a 30-point performance gap between simple and complex query types.
Simple Queries Are Easy, Complex Ones Are Hard
ColEmbed-3B-v2 NDCG@10 by query type
The complexity gradient
| Query Type | NDCG@10 | Complexity |
|---|---|---|
| Boolean | 72 | Simple |
| Numerical | 68 | Simple |
| Extractive | 62 | Medium |
| Enumerative | 58 | Medium |
| Compare-Contrast | 52 | Complex |
| Open-ended | 48 | Complex |
| Multi-hop | 42 | Complex |
Simple queries ask for a specific fact that exists in one location. Complex queries require synthesis across multiple pieces of information or reasoning about relationships.
Why multi-hop is hardest
Multi-hop queries require finding information in one place, then using it to find more information elsewhere. "What was the revenue growth in the region that had the highest market share?" requires:
- Finding the market share data
- Identifying the highest region
- Finding revenue growth for that specific region
Current retrievers struggle because they match queries to individual pages. They do not naturally chain retrievals based on intermediate findings.
Query format effects
Question-format queries consistently outperform instruction and keyword formats:
| Format | Average NDCG@10 |
|---|---|
| Question | 58 |
| Instruction | 52 |
| Keyword | 48 |
This suggests retrieval models are better trained on question-answering data than on keyword search or task instructions. If your users tend toward keyword searches, expect lower performance than benchmarks suggest.
Visual Grounding Gap
The most striking finding: current VLMs cannot reliably point to where answers come from.
Visual Grounding: Models Lag Far Behind Humans
F1 score for bounding box localization
Human annotators agree with each other at F1 = 0.602 on bounding box localization. The best VLMs achieve F1 = 0.089 (Qwen3-VL-30B) and F1 = 0.065 (Gemini 3 Pro).
That is an 85% gap between human and model performance.
Visual grounding means pointing to the specific location in a document that supports an answer. Instead of just saying "revenue grew 15%," a grounded system would highlight the exact cell in the table or region of the chart that shows this. This is critical for user trust and verification.
Where models fail
The page-level analysis reveals the bottleneck:
| Outcome | Qwen3-VL | Gemini 3 Pro |
|---|---|---|
| Both annotated page | 17% | 16% |
| Neither annotated | 46% | 49% |
| Model only | 10% | 7% |
| Human only | 26% | 27% |
Models miss 26-27% of pages that humans annotate. The problem is recall, not precision. Models are not drawing wrong boxes; they are failing to draw boxes at all.
Why this matters
Visual grounding is not a nice-to-have. For enterprise RAG systems, users need to verify answers against source documents. If the system cannot point to where it found information, users must manually search the document to confirm accuracy.
The 85% gap means current VLMs cannot reliably provide this capability. Teams deploying RAG systems should not promise visual grounding until model capabilities improve significantly.
End-to-End Generation
The benchmark also evaluates final answer quality, not just retrieval. This reveals how retrieval choices affect downstream generation.
Context modality matters
With the same generator (Gemini 3 Pro), different context types yield different accuracy:
| Context Type | Hard Query Accuracy |
|---|---|
| Image (oracle) | 64.7% |
| Text (oracle) | 62.3% |
| Hybrid (oracle) | 63.4% |
| Image (retrieved) | 54.5% |
| Hybrid (retrieved) | 54.7% |
| Text (retrieved) | 52.1% |
Image context outperforms text context by 2-3 points on hard queries. The visual information that text extraction loses actually matters for answer generation.
The benchmark classifies queries as "easy" if any of six LLMs can answer them correctly without document context (from parametric knowledge). "Hard" queries require the document. About 51% of queries are hard. Performance on hard queries better reflects true RAG capability.
The oracle gap
An oracle is a hypothetical perfect retriever that always returns exactly the right documents. It represents the upper bound of what any retrieval system could achieve. Comparing real systems to the oracle shows how much room remains for improvement.
The best non-oracle pipeline achieves 54.7% on hard queries. The image oracle achieves 64.7%. That is a 10-point gap representing the ceiling for retrieval improvements.
Even with perfect retrieval (oracle), Gemini 3 Pro only reaches 64.7%. The remaining 35% gap comes from generation limitations: the model has all the right context but still produces wrong answers.
Model comparison
| Generator | Hard Query Accuracy |
|---|---|
| GPT-5.2 | 54.1% |
| Gemini 3 Pro | 54.5% |
| Gemini 3 Flash | 50.3% |
| Qwen3-VL-235B | 51.0% |
The differences between top models are smaller than the retrieval-to-oracle gap. Improving retrieval matters more than switching generators.
Practical Takeaways
Based on the benchmark findings, here are actionable recommendations for RAG pipeline builders:
1. Use visual retrievers
Even if you have good OCR and text extraction, visual retrievers capture layout and structural information that text misses. The 8.8-point advantage is too large to ignore.
Recommended models: ColEmbed-3B-v2, Jina-v4 (visual mode), ColNomic
2. Add textual reranking
The +13.2 point improvement from textual reranking is the single largest gain available. Add it to any pipeline, visual or textual.
Recommended rerankers: zerank-2, BGE-reranker-v2
3. Provide multimodal context to generators
For hard queries, hybrid context (text + images) outperforms text-only. Pass both extracted text and page images to your generator.
4. Set appropriate expectations for complex queries
Your system will handle Boolean and Numerical queries well. Multi-hop and Open-ended queries will struggle. Design your user experience around these limitations.
5. Do not promise visual grounding
Current VLMs cannot reliably point to source locations. Until the 85% gap closes, visual grounding should be treated as experimental, not production-ready.
Implementation Notes
Integrating with MTEB
ViDoRe V3 is integrated into the MTEB (Massive Text Embedding Benchmark) ecosystem. You can evaluate your retrieval models directly:
# Using MTEB
from mteb import MTEB
benchmark = MTEB(tasks=["ViDoRe-v3"])
results = benchmark.run(model)Accessing the benchmark
The public datasets (8 of 10) are available on HuggingFace:
from datasets import load_dataset
dataset = load_dataset("vidore/vidore-v3-finance-en")Two datasets are held out as a private test set to prevent overfitting.
Retrieval pipeline example
A minimal visual retrieval pipeline:
from colpali_engine import ColEmbed
# Load visual retriever
retriever = ColEmbed.from_pretrained(
"vidore/colembed-3b-v2"
)
# Index documents (page images)
embeddings = retriever.encode_images(page_images)
# Query
query_emb = retriever.encode_queries([query])
scores = query_emb @ embeddings.T
top_k = scores.argsort()[-10:][::-1]Reranking stage
Add textual reranking to top-k candidates:
from rerankers import Reranker
reranker = Reranker("zerank-2")
# Extract text from top-k pages
candidates = [extract_text(pages[i]) for i in top_k]
# Rerank
reranked = reranker.rank(query, candidates)
final_order = [top_k[i] for i in reranked.indices]Limitations
Domain coverage
The benchmark covers 10 professional domains, but enterprise documents vary widely. Performance on your specific domain may differ from benchmark averages.
Query distribution
Extractive queries dominate due to annotation practicality. Multi-hop queries, which are often most important for users, are underrepresented.
Language scope
While 6 languages are supported, the primary documents are English and French. Performance on other languages relies on query translation, which may introduce artifacts.
Private test set
Two datasets are held private to prevent overfitting. This is good for benchmark integrity but limits full analysis of model behavior across all domains.
Visual grounding subjectivity
Bounding box annotation has inherent subjectivity (human F1 = 0.602, not 1.0). The "ground truth" is approximate, which complicates model evaluation.
Original paper: arXiv ・ PDF ・ HTML
Benchmark: HuggingFace ・ MTEB Leaderboard
Authors: Antonio Loison, Quentin Mace, Antoine Edy, Victor Xing, Tom Balough, Gabriel Moreira, Bo Liu, Manuel Faysse, Celine Hudelot, Gautier Viaud (Illuin Technology, NVIDIA, CentraleSupelec)
Cite this paper
Antonio Loison, Quentin Mace, Antoine Edy, Victor Xing, Tom Balough, Gabriel Moreira, Bo Liu, Manuel Faysse, Celine Hudelot, Gautier Viaud (2026). ViDoRe V3: The Benchmark That Exposes What Your RAG Pipeline Cannot See. arXiv 2026.