RAG-Anything: Unified Multimodal Retrieval for Real-World Documents

TL;DR

The Problem. Traditional RAG only sees text. Charts, tables, diagrams, and equations (40% of enterprise documents) are invisible. When you ask about "Figure 3," RAG retrieves the caption but never looks at the actual figure
The Solution. Dual-graph architecture: one graph connects text to images to tables (cross-modal), another preserves document hierarchy and entity relationships (textual semantics). Hybrid retrieval combines vector similarity with graph traversal
The Results. 13+ percentage point accuracy gain on 100+ page documents. The advantage grows with document complexity because graph structure preserves relationships that chunking destroys. Open-source with 11k+ GitHub stars

Research Overview

Most RAG systems treat documents as if they were plain text files. They chunk paragraphs, embed them as vectors, and retrieve based on semantic similarity. This works reasonably well for text-heavy content. But real documents are different.

A financial report contains charts showing revenue trends. A research paper includes diagrams explaining system architecture. A legal contract has tables with payment schedules. When you ask a question that requires understanding these visual elements, text-only RAG fails silently. It retrieves text that mentions the chart without understanding what the chart actually shows.

What is RAG?

Retrieval-Augmented Generation (RAG) gives AI systems access to external knowledge. Before answering a question, the system searches a document collection for relevant passages and includes them in the prompt. This allows AI to provide accurate, up-to-date answers based on specific documents rather than relying solely on training data.

RAG-Anything addresses this gap with a unified framework that treats text, images, tables, and equations as first-class knowledge entities. The key innovation is a dual-graph architecture that captures relationships both within and across modalities, combined with a hybrid retrieval system that navigates this structure.

Key results

Benchmark	Document Length	RAG-Anything	Best Baseline	Improvement
DocBench	101-200 pages	68.2%	54.6%	+13.6 pp
DocBench	200+ pages	68.8%	55.0%	+13.8 pp
DocBench	Overall	63.4%	60.0% (chunk-only)	+3.4 pp

pp = percentage points (absolute difference, not relative)

The performance gap widens as documents get longer. Traditional chunking loses track of relationships between elements spread across many pages. The graph structure preserves these connections.

The Multimodal Gap

Current RAG frameworks face a fundamental mismatch between their capabilities and real-world document structure.

What makes documents multimodal?

Most professional documents combine multiple content types. A product specification has images showing physical dimensions. A medical record includes diagnostic scans. A technical manual contains flowcharts and schematics. Each modality carries information that text alone cannot capture.

Consider what happens when you ask: "What is the company's growth trajectory?"

A text-only system retrieves a paragraph that says "See Chart 3 for quarterly performance details." It found text that mentions the chart, but it cannot actually look at the chart. RAG-Anything processes the line graph itself, sees the trend pointing upward at a steep angle, and answers: "Revenue grew 47% quarter-over-quarter, with acceleration in Q3." The difference is between finding a reference to information versus understanding the information itself.

RAG-Anything solves this by:

Parsing documents holistically. Images, tables, and equations are extracted alongside text, each processed by specialized analyzers.
Building knowledge graphs. Entities from all modalities become nodes. A chart becomes connected to the text that references it, the data points it contains, and the conclusions drawn from it.
Retrieving across modalities. Queries can match text, images, or relationships between them. The system returns coherent chunks that include all relevant content types.

Why existing approaches fall short

Approach	Limitation
Text-only RAG	Ignores visual content entirely
Image captioning	Loses structured data from tables and charts
Separate modality pipelines	Misses cross-modal relationships
Simple multimodal embeddings	Cannot capture complex document structure

The problem compounds with document length. A 200-page technical manual might reference the same diagram dozens of times across different sections. Traditional chunking treats each reference independently, losing the unified understanding that a human reader builds.

Architecture

RAG-Anything operates through four stages: document parsing, content understanding, knowledge graph construction, and modality-aware retrieval.

RAG-Anything Pipeline Architecture

Four stages transform multimodal documents into queryable knowledge

High-level flow

Documents enter the system and get decomposed into constituent elements. Each element (text block, image, table, equation) is processed by a specialized analyzer. The results feed into a knowledge graph that captures both the content and its relationships. Queries traverse this graph to retrieve coherent, multimodal responses.

Stage 1: Document parsing

The framework uses MinerU, an open-source document extraction tool, for structure parsing. This handles PDFs, Office documents, and images with high fidelity. The parser identifies:

Text blocks with their hierarchical position (section, subsection, paragraph)
Images with bounding boxes and associated captions
Tables with row/column structure preserved
Mathematical equations in LaTeX format (the standard code format for writing complex formulas)

Stage 2: Content understanding

Each content type routes to a specialized pipeline:

Text processing: Standard NLP extraction for entities, relationships, and semantic content.

Image analysis: Vision models generate descriptions, identify objects, and extract any embedded text or data.

Table interpretation: Statistical algorithms identify patterns, trends, and structural relationships between cells.

Equation parsing: Mathematical expressions map to conceptual meanings and connect to related textual explanations.

Stage 3: Knowledge graph construction

This is where RAG-Anything diverges from standard approaches. Instead of isolated chunks, the system builds a graph where:

Nodes represent entities from any modality (concepts, figures, table cells, equations)
Edges capture relationships (references, contains, explains, derives-from)
Weights indicate relationship strength based on semantic proximity

The dual-graph structure maintains two complementary views:

Cross-modal graph: Connects entities across different content types
Textual semantics graph: Preserves fine-grained relationships within text

Stage 4: Modality-aware retrieval

Queries activate both retrieval mechanisms:

Vector similarity: Finds semantically related content across the corpus
Graph traversal: Follows relationship edges to discover connected context

The results merge based on content-type relevance, ensuring that retrieved chunks maintain relational coherence.

Dual-Graph Construction

The knowledge graph is the core innovation. RAG-Anything builds two complementary structures that capture different aspects of document semantics.

Dual-Graph Architecture

Two complementary graphs for cross-modal and textual relationships

Understanding graphs: a city map analogy

Think of a document like a city. Traditional RAG is like having a list of addresses. You can find individual buildings, but you don't know how to get between them. RAG-Anything is like a GPS map that knows the roads, bridges, and transit lines connecting residential houses (text paragraphs) to commercial buildings (charts and tables). When you ask for directions, the GPS doesn't just give you one address. It shows you the route through connected streets.

Why two graphs?

A single graph struggles to balance granularity. Fine-grained textual relationships (word-level semantics) operate at a different scale than cross-modal relationships (figure-to-section connections). The dual structure lets each graph optimize for its specific purpose.

This graph connects entities across content types. When a paragraph references "Figure 3," an edge links the text node to the figure node. The edge carries metadata:

Relationship type (references, explains, contradicts, supports)
Confidence score from the extraction model
Context window (surrounding text that establishes the relationship)

These edges enable queries like "What does Figure 3 show?" to retrieve both the figure and the explanatory text that accompanies it.

Textual semantics graph

Within text, the system maintains traditional knowledge graph structures:

Entity nodes (people, organizations, concepts, quantities)
Relationship edges (employs, located-in, causes, correlates-with)
Hierarchical edges (section contains subsection contains paragraph)

This preserves the fine-grained reasoning capabilities of text-focused systems while adding multimodal awareness.

Graph construction process

Entity extraction: Each content analyzer identifies entities in its modality
Relationship detection: Cross-references are parsed from text; spatial proximity identifies visual relationships
Edge weighting: Semantic similarity and contextual relevance determine edge strengths
Graph merging: Modal-specific subgraphs combine into unified structures

The construction runs incrementally as documents are ingested, allowing the system to scale to large corpora.

Hybrid Retrieval

Retrieval combines vector similarity search with graph traversal. Neither approach alone captures the full picture.

What are vector embeddings?

AI systems turn text into lists of numbers called "embeddings." Each number represents some aspect of the meaning. Similar meanings produce similar numbers. When you search, the system compares your query's numbers to every document's numbers and returns the closest matches. It's like converting words into GPS coordinates, then finding locations near your search point.

Vector search vs graph traversal

Vector search finds content with similar meaning to the query. Graph traversal finds content connected to relevant nodes through relationships. A query about "Q3 revenue trends" might vector-match a paragraph mentioning revenue, then graph-traverse to find the associated chart that visualizes the actual numbers.

Vector-graph fusion

The retrieval algorithm:

Encode the query into the same embedding space as document content
Vector search identifies top-k semantically similar nodes
Graph expansion follows edges from matched nodes to discover related content
Modality scoring adjusts rankings based on content-type relevance to the query
Coherence filtering ensures returned chunks maintain relational integrity

Modality-aware ranking

Not all content types are equally relevant for every query. A question asking "describe the methodology" likely wants text. A question asking "show me the results" might prioritize figures and tables.

The ranking function learns these preferences from training data, adjusting scores based on:

Query phrasing (action verbs like "show" or "visualize" boost visual content)
Domain patterns (financial queries often need table data)
Relationship signals (content connected to highly-scored nodes gets boosted)

Handling long documents

For documents exceeding 100 pages, the graph structure provides significant advantages:

Challenge	Chunking approach	Graph approach
Distant references	Lost across chunks	Preserved as edges
Repeated concepts	Duplicated embeddings	Single node, multiple edges
Cross-section relationships	Missed entirely	Explicit connections
Context windows	Fixed size limits	Variable based on relevance

The 13+ point improvement on long documents stems directly from these structural advantages.

Benchmark Results

RAG-Anything was evaluated on DocBench and MMLongBench, two benchmarks designed to test multimodal document understanding.

DocBench performance

DocBench covers five domains: academia, finance, government, legal, and news. Documents include text, images, and unanswerable queries (testing the system's ability to recognize when information is missing).

Configuration	Accuracy
RAG-Anything (full)	63.4%
Chunk-only variant	60.0%
GPT-4o-mini baseline	~55%
LightRAG	~54%

The chunk-only ablation confirms that graph structure is essential. Removing the knowledge graph and using traditional chunking drops accuracy by 3.4 points.

Performance by document length

The gap between RAG-Anything and baselines grows with document length:

DocBench Performance by Document Length

Accuracy gap grows as documents get longer

The chart reveals a clear trend: while both systems start close together on short documents, traditional RAG "falls off a cliff" once documents exceed 100 pages. RAG-Anything maintains its accuracy because the graph structure preserves connections across hundreds of pages. Short documents have less need for cross-document relationship tracking, which explains why the gap starts small.

Component testing (ablation studies)

To understand which parts of the system matter most, the authors removed components one at a time and measured the impact:

Variant	Change	Accuracy impact
No cross-modal graph	Remove cross-modality edges	-2.1%
No textual graph	Remove fine-grained text relationships	-1.8%
Vector-only retrieval	Disable graph traversal	-2.5%
Graph-only retrieval	Disable vector search	-1.9%

Both graphs contribute, and both retrieval mechanisms are necessary. The hybrid approach outperforms either component in isolation.

Practical Applications

RAG-Anything addresses real pain points in document-heavy workflows.

Enterprise search

Corporate knowledge bases contain mixed content: slide decks with charts, reports with tables, specifications with diagrams. Traditional search finds documents but cannot answer questions that require understanding visual content.

With RAG-Anything:

"What were our top 3 markets last quarter?" retrieves the relevant chart and extracts the data
"Compare the architecture in proposal A vs proposal B" finds and analyzes diagrams from both documents
"Summarize the budget breakdown" locates tables and synthesizes the numbers

Academic research

Research papers are inherently multimodal. Methods sections reference figures. Results depend on tables. Equations define the mathematical framework.

Researchers can query:

"How does this paper's approach differ from [citation]?" with system understanding of architectural diagrams
"What datasets were used and what were the results?" pulling both methodology text and results tables
"Explain the loss function" connecting equations to their textual explanations

Financial analysis

Financial documents combine narrative with data-heavy exhibits. Analyst reports include charts. SEC filings have extensive tables.

The system handles:

"What's driving the margin compression mentioned on page 47?" following references to supporting exhibits
"Compare revenue growth across segments" finding and interpreting the relevant breakdowns
"Summarize risk factors related to supply chain" aggregating text and any supporting visualizations

Legal document review

Contracts and legal filings often contain schedules, exhibits, and referenced documents. Understanding requires following these cross-references.

Applications include:

"What are the payment terms?" finding the relevant schedule and interpreting the table structure
"Summarize all indemnification clauses" aggregating provisions across document sections
"What exhibits are referenced in Section 4.2?" following explicit cross-references

Implementation blueprint

Research papers describe what works. They rarely explain how to build it. Here is the practical stack and workflow for implementing RAG-Anything in production.

The tech stack

Layer	Options	Why
Document Parser	MinerU, Unstructured.io	Must return bounding boxes for images, not just extracted text. Position data is critical for cross-modal linking.
Graph Database	Neo4j, Amazon Neptune	Need a database that handles both structured nodes and can integrate with vector indices. Neo4j has native vector search as of 2024.
Vector Database	Milvus, Qdrant, Pinecone	For the dense retrieval component. Can be separate or integrated with the graph DB.
Vision Model	GPT-4o, Gemini 1.5 Pro, Claude	Converts images to text descriptions. Critical for making visual content searchable.
Orchestration	LlamaIndex, LangChain	LlamaIndex has mature PropertyGraph and GraphRAG implementations. Handles the retrieval fusion logic.

The parser choice matters most

Most document parsers strip images and return plain text. You need one that preserves spatial coordinates. When the parser tells you "Figure 1 is at coordinates (x: 450, y: 1200)" and "the text 'see Figure 1' is at (x: 100, y: 1180)", you can infer they're related. Without coordinates, you're guessing.

The build workflow

Step 1: Document ingestion

PDF → Parser → Structured Output
                  ├── Text chunks (with coordinates)
                  ├── Images (extracted as files)
                  ├── Tables (as structured data)
                  └── Equations (as LaTeX)

Run each document through MinerU or Unstructured. The output should include:

Text blocks with page number and bounding box
Images saved as separate files with their coordinates
Tables preserved as structured data, not flattened text

Step 2: Visual content processing

Pass every extracted image to a vision model:

Image → Vision Model → Text Description
"A line chart showing quarterly revenue from Q1 2023
to Q4 2024. Revenue increases from $2.3M to $4.1M
with steepest growth in Q3 2024."

This description becomes searchable. When someone asks about "revenue growth," the vector search can match this text even though the original was an image.

Step 3: Graph construction (the hard part)

Create nodes for each content type:

(:TextChunk {id, content, embedding, page, coordinates})
(:Image {id, description, embedding, file_path, page, coordinates})
(:Table {id, structured_data, summary, embedding, page})
(:Document {id, title, metadata})

The critical step is creating relationship edges. Use two signals:

Explicit references. Parse text for patterns like "Figure 1", "Table 3", "see chart below". Create edges: (:TextChunk)-[:REFERENCES]->(:Image)
Spatial proximity. If a text chunk and image are on the same page within 200 pixels, they're likely related. Create edges: (:TextChunk)-[:NEAR]->(:Image)

// Cypher example for Neo4j
MATCH (t:TextChunk), (i:Image)
WHERE t.page = i.page
  AND abs(t.y_coord - i.y_coord) < 200
CREATE (t)-[:SPATIALLY_RELATED]->(i)

Step 4: Hybrid retrieval

When a query arrives:

Vector search. Find top-k text chunks by embedding similarity
Graph expansion. For each matched chunk, traverse edges to find connected images, tables, equations
Relevance scoring. Rank the expanded set by combined vector + graph signal
Context assembly. Build the prompt with text chunks AND their connected visual descriptions

# Pseudocode
matches = vector_search(query, top_k=5)
expanded = []
for chunk in matches:
    connected = graph.traverse(chunk, edge_types=['REFERENCES', 'NEAR'])
    expanded.extend(connected)
context = deduplicate_and_rank(matches + expanded)
response = llm.generate(query, context)

Where teams get stuck

Problem 1: Parser quality. Most parsers mangle tables and miss image coordinates. Budget time for parser evaluation.

Problem 2: Edge creation heuristics. The "200 pixels" threshold is illustrative. The actual value depends on your parser's output resolution (DPI) and document type. Legal documents have different layouts than research papers. Expect to tune this.

Problem 3: Graph query performance. Naive traversal gets slow. Index your edges. Limit traversal depth. Cache frequent patterns.

Problem 4: Context window limits. Pulling in every connected image description bloats the context. Implement relevance filtering on the expanded set.

Start simple, add complexity

Don't build the full dual-graph architecture on day one. Start with explicit reference parsing ("Figure 1", "Table 2"). Add spatial proximity once that works. Add the textual semantics graph last. Each layer adds value independently.

Business Implications

For RAG product teams

Architecture investment priority. If your documents contain any non-text content, single-modality RAG will systematically fail on queries that require understanding that content. RAG-Anything's approach (or similar multimodal frameworks) becomes necessary, not optional.

Quality ceiling recognition. Text-only systems hit a hard ceiling on document understanding. No amount of prompt engineering or model upgrades fixes the fundamental issue of missing visual information.

Evaluation methodology. Standard RAG benchmarks often use text-heavy datasets. Test on actual customer documents that include charts, tables, and images to understand real-world performance.

For enterprise search teams

Content audit. Survey your document corpus. What percentage contains meaningful non-text content? That percentage represents queries where current text-only systems will fail.

User expectation management. Users expect AI systems to "see" what they see in documents. When the system cannot answer questions about visual content, trust erodes.

Incremental adoption path. RAG-Anything is open source and can run alongside existing systems. Pilot on document types with high multimodal content before full rollout.

For document management vendors

Feature differentiation. Multimodal RAG moves from nice-to-have to table stakes as users encounter the limitations of text-only systems.

Integration complexity. The framework requires vision models, knowledge graph infrastructure, and hybrid retrieval. This is significantly more complex than text-only RAG pipelines.

Compute requirements. Processing images and building knowledge graphs adds computational overhead. Pricing models may need adjustment.

For AI infrastructure teams

Pipeline orchestration. RAG-Anything coordinates multiple specialized models (text, vision, graph). Infrastructure must support this heterogeneous workload.

Storage considerations. Knowledge graphs require graph database infrastructure. Vector stores alone are insufficient.

Latency budgets. Graph traversal adds retrieval time compared to pure vector search. Benchmark actual performance against user expectations.

Limitations

Computational overhead

Processing visual content and building knowledge graphs requires more compute than text-only approaches. For high-volume applications, this overhead may be significant.

The framework requires:

Vision models for image analysis
Graph database for relationship storage
Multiple embedding models for different content types

Domain-specific tuning

The content analyzers work best when tuned for specific document types. A financial analyst system needs different table interpretation than a medical records system.

Out-of-the-box performance may require customization for specialized domains.

Parsing quality dependency

The system relies on its document parser for structure extraction. Poorly formatted PDFs, scanned documents with OCR errors, or unusual layouts can degrade downstream performance.

Garbage in, garbage out still applies. Document quality matters.

Graph construction latency

Building the knowledge graph takes time during document ingestion. For real-time applications where documents arrive and must be immediately queryable, this latency may be problematic.

The authors don't provide detailed latency benchmarks, which would be valuable for production planning.

Complex query handling

While the system handles straightforward multimodal queries well, highly complex reasoning chains (multi-hop questions requiring synthesis across many document sections) remain challenging.

The graph structure helps but doesn't fully solve multi-step reasoning.

Original paper: arXiv ・ PDF ・ HTML

Code: GitHub

Authors: Zirui Guo, Xubin Ren, Lingrui Xu, Jiahao Zhang, Chao Huang (Data Intelligence Lab, The University of Hong Kong)

Authors

Zirui GuoThe University of Hong Kong,Xubin RenThe University of Hong Kong,Lingrui XuThe University of Hong Kong,Jiahao ZhangThe University of Hong Kong,Chao HuangThe University of Hong Kong

Code & Data

Cite this paper

Zirui Guo, Xubin Ren, Lingrui Xu, Jiahao Zhang, Chao Huang (2025). RAG-Anything: Unified Multimodal Retrieval for Real-World Documents. arXiv 2025.

Key Findings