arXiv 2026January 13, 2026

ViDoRe V3: The Benchmark That Exposes What Your RAG Pipeline Cannot See

Antonio Loisonet al.

ViDoRe V3 is a comprehensive multimodal RAG benchmark that evaluates retrieval, generation, and visual grounding on visually rich documents. Testing state-of-the-art systems reveals that visual retrievers outperform textual ones, late-interaction models and textual reranking substantially improve performance, and current models still struggle with non-textual elements, open-ended queries, and fine-grained visual grounding.

Categories:Information RetrievalMultimodal AIBenchmarks

Key Findings

1

Visual retrievers beat text-only: ColEmbed-3B-v2 scores 59.8 NDCG@10 versus 51.0 for the best textual retriever, an 8.8 point advantage

2

Textual rerankers deliver massive gains: adding a text reranker improves retrieval by +13.2 points, while visual rerankers add only +0.2

3

Query complexity matters: simple queries (Boolean, Numerical) score 30 points higher than complex ones (Multi-hop, Open-ended)

4

Visual grounding is broken: human annotators achieve 0.602 F1 for bounding box localization, but the best VLMs score only 0.089

5

Hybrid context helps generation: combining text and image retrieval yields the best answer accuracy on hard queries (54.7% vs 52.1% text-only)

6

Human-verified at scale: 12,000 hours of expert annotation across 26K pages, 3K queries, 10 datasets, and 6 languages

TL;DR
  1. The Problem. RAG benchmarks focus on text, but production documents contain charts, tables, and diagrams. Existing evaluations miss the multimodal reality of enterprise documents.

  2. The Benchmark. ViDoRe V3 covers 26K document pages across 10 professional domains with 3K human-verified queries in 6 languages. Every query includes retrieval relevance, bounding box localization, and verified reference answers.

  3. The Findings. Visual retrievers beat text-only by 8+ points. Textual rerankers deliver 66x more improvement than visual ones. And current VLMs fail at visual grounding with an 85% gap to human performance.

Research Overview

If you have built a RAG system for real documents, you know the problem: your benchmark numbers look great, but users complain the system cannot find information in charts and tables. The disconnect is not your implementation. It is the benchmark.

Most RAG evaluations test text retrieval on text-heavy documents. They measure whether the system can find a paragraph that answers a question. But professional documents are not just paragraphs. Financial reports have revenue charts. Technical manuals have architecture diagrams. Research papers have results tables. When your benchmark ignores these, you are optimizing for the wrong thing.

What is a RAG benchmark?

A RAG benchmark is a standardized test for retrieval-augmented generation systems. It provides a set of documents, queries, and correct answers. You run your system on the queries, measure how often it retrieves the right documents and generates correct answers, then compare against other systems. Good benchmarks reflect real-world usage patterns.

ViDoRe V3 addresses this gap. It is a comprehensive multimodal RAG benchmark that tests retrieval, answer generation, and visual grounding on documents that actually look like what enterprises deal with: PDFs with charts, tables, images, and complex layouts.

The benchmark comes from a collaboration between Illuin Technology, NVIDIA, and CentraleSupelec. They invested 12,000 hours of human annotation to create something that existing benchmarks lack: ground truth for multimodal document understanding.

ViDoRe V3 Benchmark at a Glance

Human-verified multimodal RAG evaluation at scale

What makes it different

FeatureTypical BenchmarksViDoRe V3
ContentText-onlyText, tables, charts, images
AnnotationAutomated or crowdExpert with 12K hours
GroundingNoneBounding boxes per answer
LanguagesEnglish6 languages
DomainsGeneral10 professional domains
ScaleVaries26K pages, 3K queries

The benchmark evaluates three capabilities that matter for production RAG: Can the system find the right pages? Can it generate correct answers? Can it point to where in the document the answer came from?

The Benchmark

ViDoRe V3 covers 10 document corpora spanning Finance, Computer Science, Energy, Pharmaceuticals, Human Resources, Industrial Maintenance, Telecom, and Physics. Seven corpora are English, three are French.

Why these domains?

These are the domains where multimodal documents are common and RAG systems are deployed. A financial analyst needs to query earnings reports with charts. A maintenance technician needs to find procedures with diagrams. The benchmark tests what practitioners actually need.

Query taxonomy

The benchmark uses a dual-axis classification for queries:

Query types (what the user needs):

  • Extractive: Find a specific fact
  • Numerical: Find a number or calculation
  • Boolean: Yes/no questions
  • Open-ended: Synthesis or explanation
  • Multi-hop: Requires combining information
  • Compare-contrast: Analyze differences
  • Enumerative: List multiple items

Query formats (how the user asks):

  • Question: Natural language question
  • Keyword: Search-style terms
  • Instruction: Task-oriented request

This taxonomy enables fine-grained analysis. You can see that your system handles extractive questions well but struggles with multi-hop reasoning.

Annotation methodology

The 12,000 hours of annotation produced three layers of ground truth:

  1. Retrieval relevance. For each query, which pages contain relevant information? Annotators rated pages as Not Relevant, Partially Relevant, or Fully Relevant.

  2. Bounding boxes. On relevant pages, exactly where is the supporting evidence? Annotators drew boxes around the specific text, table cells, or chart regions.

  3. Reference answers. What is the correct answer, grounded in the document? Multiple annotators wrote answers, which were then merged into consensus responses.

This multi-layer annotation is what sets ViDoRe V3 apart. Most benchmarks stop at retrieval relevance. The bounding boxes enable evaluation of visual grounding, a capability that existing benchmarks cannot measure.

Visual vs Textual Retrieval

The first major finding: visual retrievers consistently outperform text-only approaches.

Visual Retrievers Outperform Textual Ones

NDCG@10 retrieval performance across 10 datasets

The best visual retriever (ColEmbed-3B-v2) achieves 59.8 NDCG@10 compared to 51.0 for the best textual retriever (Qwen3-8B). That is an 8.8 point gap.

What is NDCG@10?

Normalized Discounted Cumulative Gain at rank 10. It measures how well the top 10 retrieved results match the ground truth, with higher-ranked correct results counting more. A score of 100 means perfect ranking. Scores above 50 are generally considered good for document retrieval tasks.

Why visual retrievers win

Visual retrievers process page images directly. They see the layout, the charts, the table structures. Textual retrievers work from extracted text, which loses this structural information.

Think of a city planner looking at a satellite map versus a tourist reading a list of street names. The planner can instantly spot where the river bends, where the park lies, and how neighborhoods connect. The tourist has addresses but no sense of geography. Visual retrievers are the planner: they see the whole page and understand spatial relationships. Textual retrievers are the tourist: they have the words but have lost the layout.

Consider a query about "Q3 revenue trends." The textual retriever might find a paragraph mentioning revenue. The visual retriever sees the actual chart showing the trend line. It matches the query to visual evidence that the text extraction missed entirely.

This advantage holds across model sizes. A 3B visual retriever beats an 8B textual retriever. The modality matters more than the parameter count.

Late interaction beats dense embeddings

Within visual retrievers, late interaction models (ColEmbed, ColQwen) outperform single-vector dense embeddings (Nomic).

Model TypeBest Score
Late interaction visual59.8
Dense visual49.0
Late interaction textual51.0
Dense textual44.9
BM25 (keyword)20.3
What is BM25?

BM25 is a classic keyword-based ranking algorithm that scores documents by term frequency and inverse document frequency. It has been the standard baseline for search engines for decades. The low score here (20.3) shows why neural retrievers have become essential for complex document queries.

What is late interaction?

Late interaction models keep separate embeddings for each token or image patch, then compute fine-grained similarity at query time. Dense embeddings compress everything into a single vector, losing detail. Late interaction is like keeping every book on individual shelves versus writing a one-page summary of the entire library.

Late interaction models maintain separate embeddings for each token/patch and compute similarity at query time. This preserves more information than compressing everything into a single vector.

The Reranker Surprise

The second major finding caught the researchers off guard: textual rerankers massively outperform visual ones.

Textual Rerankers Deliver Massive Gains

Adding a reranker to retrieval pipelines (NDCG@10)

Adding a textual reranker (zerank-2) to a textual retrieval pipeline improves NDCG@10 by +13.2 points. Adding a visual reranker (jina-reranker-m0) to a visual pipeline improves by only +0.2 points.

That is a 66x difference in reranker impact.

What this means for pipeline design

The optimal pipeline is not what you might expect:

PipelineFinal Score
Visual retriever + textual reranker63.6
Visual retriever + visual reranker57.8
Textual retriever + textual reranker63.6
Textual retriever only50.4

The best results come from combining visual retrieval with textual reranking. The visual retriever finds candidates that text extraction would miss. The textual reranker, working from extracted text, provides superior ranking.

Why are textual rerankers so much better?

The paper does not fully explain this gap, but likely factors include: textual rerankers have more training data, text provides clearer semantic signals for ranking, and visual rerankers may struggle with the diverse document layouts in the benchmark. This is an area for future research.

The practical implication

If you are building a RAG pipeline for multimodal documents:

  1. Use a visual retriever for the initial candidate selection
  2. Add a textual reranker for the final ranking
  3. Do not assume visual rerankers will help just because you have visual content

The +13.2 point gain from textual reranking is one of the largest improvements available in the RAG pipeline. It costs only reranker inference on your top-k candidates.

Query Complexity Matters

Not all queries are created equal. ViDoRe V3 reveals a 30-point performance gap between simple and complex query types.

Simple Queries Are Easy, Complex Ones Are Hard

ColEmbed-3B-v2 NDCG@10 by query type

The complexity gradient

Query TypeNDCG@10Complexity
Boolean72Simple
Numerical68Simple
Extractive62Medium
Enumerative58Medium
Compare-Contrast52Complex
Open-ended48Complex
Multi-hop42Complex

Simple queries ask for a specific fact that exists in one location. Complex queries require synthesis across multiple pieces of information or reasoning about relationships.

Why multi-hop is hardest

Multi-hop queries require finding information in one place, then using it to find more information elsewhere. "What was the revenue growth in the region that had the highest market share?" requires:

  1. Finding the market share data
  2. Identifying the highest region
  3. Finding revenue growth for that specific region

Current retrievers struggle because they match queries to individual pages. They do not naturally chain retrievals based on intermediate findings.

Query format effects

Question-format queries consistently outperform instruction and keyword formats:

FormatAverage NDCG@10
Question58
Instruction52
Keyword48

This suggests retrieval models are better trained on question-answering data than on keyword search or task instructions. If your users tend toward keyword searches, expect lower performance than benchmarks suggest.

Visual Grounding Gap

The most striking finding: current VLMs cannot reliably point to where answers come from.

Visual Grounding: Models Lag Far Behind Humans

F1 score for bounding box localization

Human annotators agree with each other at F1 = 0.602 on bounding box localization. The best VLMs achieve F1 = 0.089 (Qwen3-VL-30B) and F1 = 0.065 (Gemini 3 Pro).

That is an 85% gap between human and model performance.

What is visual grounding?

Visual grounding means pointing to the specific location in a document that supports an answer. Instead of just saying "revenue grew 15%," a grounded system would highlight the exact cell in the table or region of the chart that shows this. This is critical for user trust and verification.

Where models fail

The page-level analysis reveals the bottleneck:

OutcomeQwen3-VLGemini 3 Pro
Both annotated page17%16%
Neither annotated46%49%
Model only10%7%
Human only26%27%

Models miss 26-27% of pages that humans annotate. The problem is recall, not precision. Models are not drawing wrong boxes; they are failing to draw boxes at all.

Why this matters

Visual grounding is not a nice-to-have. For enterprise RAG systems, users need to verify answers against source documents. If the system cannot point to where it found information, users must manually search the document to confirm accuracy.

The 85% gap means current VLMs cannot reliably provide this capability. Teams deploying RAG systems should not promise visual grounding until model capabilities improve significantly.

End-to-End Generation

The benchmark also evaluates final answer quality, not just retrieval. This reveals how retrieval choices affect downstream generation.

Context modality matters

With the same generator (Gemini 3 Pro), different context types yield different accuracy:

Context TypeHard Query Accuracy
Image (oracle)64.7%
Text (oracle)62.3%
Hybrid (oracle)63.4%
Image (retrieved)54.5%
Hybrid (retrieved)54.7%
Text (retrieved)52.1%

Image context outperforms text context by 2-3 points on hard queries. The visual information that text extraction loses actually matters for answer generation.

What are "hard" queries?

The benchmark classifies queries as "easy" if any of six LLMs can answer them correctly without document context (from parametric knowledge). "Hard" queries require the document. About 51% of queries are hard. Performance on hard queries better reflects true RAG capability.

The oracle gap

What is an oracle in retrieval?

An oracle is a hypothetical perfect retriever that always returns exactly the right documents. It represents the upper bound of what any retrieval system could achieve. Comparing real systems to the oracle shows how much room remains for improvement.

The best non-oracle pipeline achieves 54.7% on hard queries. The image oracle achieves 64.7%. That is a 10-point gap representing the ceiling for retrieval improvements.

Even with perfect retrieval (oracle), Gemini 3 Pro only reaches 64.7%. The remaining 35% gap comes from generation limitations: the model has all the right context but still produces wrong answers.

Model comparison

GeneratorHard Query Accuracy
GPT-5.254.1%
Gemini 3 Pro54.5%
Gemini 3 Flash50.3%
Qwen3-VL-235B51.0%

The differences between top models are smaller than the retrieval-to-oracle gap. Improving retrieval matters more than switching generators.

Practical Takeaways

Based on the benchmark findings, here are actionable recommendations for RAG pipeline builders:

1. Use visual retrievers

Even if you have good OCR and text extraction, visual retrievers capture layout and structural information that text misses. The 8.8-point advantage is too large to ignore.

Recommended models: ColEmbed-3B-v2, Jina-v4 (visual mode), ColNomic

2. Add textual reranking

The +13.2 point improvement from textual reranking is the single largest gain available. Add it to any pipeline, visual or textual.

Recommended rerankers: zerank-2, BGE-reranker-v2

3. Provide multimodal context to generators

For hard queries, hybrid context (text + images) outperforms text-only. Pass both extracted text and page images to your generator.

4. Set appropriate expectations for complex queries

Your system will handle Boolean and Numerical queries well. Multi-hop and Open-ended queries will struggle. Design your user experience around these limitations.

5. Do not promise visual grounding

Current VLMs cannot reliably point to source locations. Until the 85% gap closes, visual grounding should be treated as experimental, not production-ready.

Implementation Notes

Integrating with MTEB

ViDoRe V3 is integrated into the MTEB (Massive Text Embedding Benchmark) ecosystem. You can evaluate your retrieval models directly:

# Using MTEB
from mteb import MTEB
 
benchmark = MTEB(tasks=["ViDoRe-v3"])
results = benchmark.run(model)

Accessing the benchmark

The public datasets (8 of 10) are available on HuggingFace:

from datasets import load_dataset
 
dataset = load_dataset("vidore/vidore-v3-finance-en")

Two datasets are held out as a private test set to prevent overfitting.

Retrieval pipeline example

A minimal visual retrieval pipeline:

from colpali_engine import ColEmbed
 
# Load visual retriever
retriever = ColEmbed.from_pretrained(
    "vidore/colembed-3b-v2"
)
 
# Index documents (page images)
embeddings = retriever.encode_images(page_images)
 
# Query
query_emb = retriever.encode_queries([query])
scores = query_emb @ embeddings.T
top_k = scores.argsort()[-10:][::-1]

Reranking stage

Add textual reranking to top-k candidates:

from rerankers import Reranker
 
reranker = Reranker("zerank-2")
 
# Extract text from top-k pages
candidates = [extract_text(pages[i]) for i in top_k]
 
# Rerank
reranked = reranker.rank(query, candidates)
final_order = [top_k[i] for i in reranked.indices]

Limitations

Domain coverage

The benchmark covers 10 professional domains, but enterprise documents vary widely. Performance on your specific domain may differ from benchmark averages.

Query distribution

Extractive queries dominate due to annotation practicality. Multi-hop queries, which are often most important for users, are underrepresented.

Language scope

While 6 languages are supported, the primary documents are English and French. Performance on other languages relies on query translation, which may introduce artifacts.

Private test set

Two datasets are held private to prevent overfitting. This is good for benchmark integrity but limits full analysis of model behavior across all domains.

Visual grounding subjectivity

Bounding box annotation has inherent subjectivity (human F1 = 0.602, not 1.0). The "ground truth" is approximate, which complicates model evaluation.


Original paper: arXivPDFHTML

Benchmark: HuggingFaceMTEB Leaderboard

Authors: Antonio Loison, Quentin Mace, Antoine Edy, Victor Xing, Tom Balough, Gabriel Moreira, Bo Liu, Manuel Faysse, Celine Hudelot, Gautier Viaud (Illuin Technology, NVIDIA, CentraleSupelec)

Authors

Antonio LoisonIlluin Technology,Quentin MaceIlluin Technology, CentraleSupelec,Antoine EdyIlluin Technology,Victor XingIlluin Technology,Tom BaloughNVIDIA,Gabriel MoreiraNVIDIA,Bo LiuNVIDIA,Manuel FaysseCentraleSupelec, Paris-Saclay,Celine HudelotCentraleSupelec, Paris-Saclay,Gautier ViaudIlluin Technology

Cite this paper

Antonio Loison, Quentin Mace, Antoine Edy, Victor Xing, Tom Balough, Gabriel Moreira, Bo Liu, Manuel Faysse, Celine Hudelot, Gautier Viaud (2026). ViDoRe V3: The Benchmark That Exposes What Your RAG Pipeline Cannot See. arXiv 2026.

Related Research