-
The Problem. Traditional RAG only sees text. Charts, tables, diagrams, and equations (40% of enterprise documents) are invisible. When you ask about "Figure 3," RAG retrieves the caption but never looks at the actual figure
-
The Solution. Dual-graph architecture: one graph connects text to images to tables (cross-modal), another preserves document hierarchy and entity relationships (textual semantics). Hybrid retrieval combines vector similarity with graph traversal
-
The Results. 13+ percentage point accuracy gain on 100+ page documents. The advantage grows with document complexity because graph structure preserves relationships that chunking destroys. Open-source with 11k+ GitHub stars
Research Overview
Most RAG systems treat documents as if they were plain text files. They chunk paragraphs, embed them as vectors, and retrieve based on semantic similarity. This works reasonably well for text-heavy content. But real documents are different.
A financial report contains charts showing revenue trends. A research paper includes diagrams explaining system architecture. A legal contract has tables with payment schedules. When you ask a question that requires understanding these visual elements, text-only RAG fails silently. It retrieves text that mentions the chart without understanding what the chart actually shows.
Retrieval-Augmented Generation (RAG) gives AI systems access to external knowledge. Before answering a question, the system searches a document collection for relevant passages and includes them in the prompt. This allows AI to provide accurate, up-to-date answers based on specific documents rather than relying solely on training data.
RAG-Anything addresses this gap with a unified framework that treats text, images, tables, and equations as first-class knowledge entities. The key innovation is a dual-graph architecture that captures relationships both within and across modalities, combined with a hybrid retrieval system that navigates this structure.
Key results
| Benchmark | Document Length | RAG-Anything | Best Baseline | Improvement |
|---|---|---|---|---|
| DocBench | 101-200 pages | 68.2% | 54.6% | +13.6 pp |
| DocBench | 200+ pages | 68.8% | 55.0% | +13.8 pp |
| DocBench | Overall | 63.4% | 60.0% (chunk-only) | +3.4 pp |
pp = percentage points (absolute difference, not relative)
The performance gap widens as documents get longer. Traditional chunking loses track of relationships between elements spread across many pages. The graph structure preserves these connections.
The Multimodal Gap
Current RAG frameworks face a fundamental mismatch between their capabilities and real-world document structure.
Most professional documents combine multiple content types. A product specification has images showing physical dimensions. A medical record includes diagnostic scans. A technical manual contains flowcharts and schematics. Each modality carries information that text alone cannot capture.
Consider what happens when you ask: "What is the company's growth trajectory?"
A text-only system retrieves a paragraph that says "See Chart 3 for quarterly performance details." It found text that mentions the chart, but it cannot actually look at the chart. RAG-Anything processes the line graph itself, sees the trend pointing upward at a steep angle, and answers: "Revenue grew 47% quarter-over-quarter, with acceleration in Q3." The difference is between finding a reference to information versus understanding the information itself.
RAG-Anything solves this by:
-
Parsing documents holistically. Images, tables, and equations are extracted alongside text, each processed by specialized analyzers.
-
Building knowledge graphs. Entities from all modalities become nodes. A chart becomes connected to the text that references it, the data points it contains, and the conclusions drawn from it.
-
Retrieving across modalities. Queries can match text, images, or relationships between them. The system returns coherent chunks that include all relevant content types.
Why existing approaches fall short
| Approach | Limitation |
|---|---|
| Text-only RAG | Ignores visual content entirely |
| Image captioning | Loses structured data from tables and charts |
| Separate modality pipelines | Misses cross-modal relationships |
| Simple multimodal embeddings | Cannot capture complex document structure |
The problem compounds with document length. A 200-page technical manual might reference the same diagram dozens of times across different sections. Traditional chunking treats each reference independently, losing the unified understanding that a human reader builds.
Architecture
RAG-Anything operates through four stages: document parsing, content understanding, knowledge graph construction, and modality-aware retrieval.
RAG-Anything Pipeline Architecture
Four stages transform multimodal documents into queryable knowledge
Documents enter the system and get decomposed into constituent elements. Each element (text block, image, table, equation) is processed by a specialized analyzer. The results feed into a knowledge graph that captures both the content and its relationships. Queries traverse this graph to retrieve coherent, multimodal responses.
Stage 1: Document parsing
The framework uses MinerU, an open-source document extraction tool, for structure parsing. This handles PDFs, Office documents, and images with high fidelity. The parser identifies:
- Text blocks with their hierarchical position (section, subsection, paragraph)
- Images with bounding boxes and associated captions
- Tables with row/column structure preserved
- Mathematical equations in LaTeX format (the standard code format for writing complex formulas)
Stage 2: Content understanding
Each content type routes to a specialized pipeline:
Text processing: Standard NLP extraction for entities, relationships, and semantic content.
Image analysis: Vision models generate descriptions, identify objects, and extract any embedded text or data.
Table interpretation: Statistical algorithms identify patterns, trends, and structural relationships between cells.
Equation parsing: Mathematical expressions map to conceptual meanings and connect to related textual explanations.
Stage 3: Knowledge graph construction
This is where RAG-Anything diverges from standard approaches. Instead of isolated chunks, the system builds a graph where:
- Nodes represent entities from any modality (concepts, figures, table cells, equations)
- Edges capture relationships (references, contains, explains, derives-from)
- Weights indicate relationship strength based on semantic proximity
The dual-graph structure maintains two complementary views:
- Cross-modal graph: Connects entities across different content types
- Textual semantics graph: Preserves fine-grained relationships within text
Stage 4: Modality-aware retrieval
Queries activate both retrieval mechanisms:
- Vector similarity: Finds semantically related content across the corpus
- Graph traversal: Follows relationship edges to discover connected context
The results merge based on content-type relevance, ensuring that retrieved chunks maintain relational coherence.
Dual-Graph Construction
The knowledge graph is the core innovation. RAG-Anything builds two complementary structures that capture different aspects of document semantics.
Dual-Graph Architecture
Two complementary graphs for cross-modal and textual relationships
Think of a document like a city. Traditional RAG is like having a list of addresses. You can find individual buildings, but you don't know how to get between them. RAG-Anything is like a GPS map that knows the roads, bridges, and transit lines connecting residential houses (text paragraphs) to commercial buildings (charts and tables). When you ask for directions, the GPS doesn't just give you one address. It shows you the route through connected streets.
A single graph struggles to balance granularity. Fine-grained textual relationships (word-level semantics) operate at a different scale than cross-modal relationships (figure-to-section connections). The dual structure lets each graph optimize for its specific purpose.
Cross-modal relationship graph
This graph connects entities across content types. When a paragraph references "Figure 3," an edge links the text node to the figure node. The edge carries metadata:
- Relationship type (references, explains, contradicts, supports)
- Confidence score from the extraction model
- Context window (surrounding text that establishes the relationship)
These edges enable queries like "What does Figure 3 show?" to retrieve both the figure and the explanatory text that accompanies it.
Textual semantics graph
Within text, the system maintains traditional knowledge graph structures:
- Entity nodes (people, organizations, concepts, quantities)
- Relationship edges (employs, located-in, causes, correlates-with)
- Hierarchical edges (section contains subsection contains paragraph)
This preserves the fine-grained reasoning capabilities of text-focused systems while adding multimodal awareness.
Graph construction process
- Entity extraction: Each content analyzer identifies entities in its modality
- Relationship detection: Cross-references are parsed from text; spatial proximity identifies visual relationships
- Edge weighting: Semantic similarity and contextual relevance determine edge strengths
- Graph merging: Modal-specific subgraphs combine into unified structures
The construction runs incrementally as documents are ingested, allowing the system to scale to large corpora.
Hybrid Retrieval
Retrieval combines vector similarity search with graph traversal. Neither approach alone captures the full picture.
AI systems turn text into lists of numbers called "embeddings." Each number represents some aspect of the meaning. Similar meanings produce similar numbers. When you search, the system compares your query's numbers to every document's numbers and returns the closest matches. It's like converting words into GPS coordinates, then finding locations near your search point.
Vector search finds content with similar meaning to the query. Graph traversal finds content connected to relevant nodes through relationships. A query about "Q3 revenue trends" might vector-match a paragraph mentioning revenue, then graph-traverse to find the associated chart that visualizes the actual numbers.
Vector-graph fusion
The retrieval algorithm:
- Encode the query into the same embedding space as document content
- Vector search identifies top-k semantically similar nodes
- Graph expansion follows edges from matched nodes to discover related content
- Modality scoring adjusts rankings based on content-type relevance to the query
- Coherence filtering ensures returned chunks maintain relational integrity
Modality-aware ranking
Not all content types are equally relevant for every query. A question asking "describe the methodology" likely wants text. A question asking "show me the results" might prioritize figures and tables.
The ranking function learns these preferences from training data, adjusting scores based on:
- Query phrasing (action verbs like "show" or "visualize" boost visual content)
- Domain patterns (financial queries often need table data)
- Relationship signals (content connected to highly-scored nodes gets boosted)
Handling long documents
For documents exceeding 100 pages, the graph structure provides significant advantages:
| Challenge | Chunking approach | Graph approach |
|---|---|---|
| Distant references | Lost across chunks | Preserved as edges |
| Repeated concepts | Duplicated embeddings | Single node, multiple edges |
| Cross-section relationships | Missed entirely | Explicit connections |
| Context windows | Fixed size limits | Variable based on relevance |
The 13+ point improvement on long documents stems directly from these structural advantages.
Benchmark Results
RAG-Anything was evaluated on DocBench and MMLongBench, two benchmarks designed to test multimodal document understanding.
DocBench performance
DocBench covers five domains: academia, finance, government, legal, and news. Documents include text, images, and unanswerable queries (testing the system's ability to recognize when information is missing).
| Configuration | Accuracy |
|---|---|
| RAG-Anything (full) | 63.4% |
| Chunk-only variant | 60.0% |
| GPT-4o-mini baseline | ~55% |
| LightRAG | ~54% |
The chunk-only ablation confirms that graph structure is essential. Removing the knowledge graph and using traditional chunking drops accuracy by 3.4 points.
Performance by document length
The gap between RAG-Anything and baselines grows with document length:
DocBench Performance by Document Length
Accuracy gap grows as documents get longer
The chart reveals a clear trend: while both systems start close together on short documents, traditional RAG "falls off a cliff" once documents exceed 100 pages. RAG-Anything maintains its accuracy because the graph structure preserves connections across hundreds of pages. Short documents have less need for cross-document relationship tracking, which explains why the gap starts small.
Component testing (ablation studies)
To understand which parts of the system matter most, the authors removed components one at a time and measured the impact:
| Variant | Change | Accuracy impact |
|---|---|---|
| No cross-modal graph | Remove cross-modality edges | -2.1% |
| No textual graph | Remove fine-grained text relationships | -1.8% |
| Vector-only retrieval | Disable graph traversal | -2.5% |
| Graph-only retrieval | Disable vector search | -1.9% |
Both graphs contribute, and both retrieval mechanisms are necessary. The hybrid approach outperforms either component in isolation.
Practical Applications
RAG-Anything addresses real pain points in document-heavy workflows.
Enterprise search
Corporate knowledge bases contain mixed content: slide decks with charts, reports with tables, specifications with diagrams. Traditional search finds documents but cannot answer questions that require understanding visual content.
With RAG-Anything:
- "What were our top 3 markets last quarter?" retrieves the relevant chart and extracts the data
- "Compare the architecture in proposal A vs proposal B" finds and analyzes diagrams from both documents
- "Summarize the budget breakdown" locates tables and synthesizes the numbers
Academic research
Research papers are inherently multimodal. Methods sections reference figures. Results depend on tables. Equations define the mathematical framework.
Researchers can query:
- "How does this paper's approach differ from [citation]?" with system understanding of architectural diagrams
- "What datasets were used and what were the results?" pulling both methodology text and results tables
- "Explain the loss function" connecting equations to their textual explanations
Financial analysis
Financial documents combine narrative with data-heavy exhibits. Analyst reports include charts. SEC filings have extensive tables.
The system handles:
- "What's driving the margin compression mentioned on page 47?" following references to supporting exhibits
- "Compare revenue growth across segments" finding and interpreting the relevant breakdowns
- "Summarize risk factors related to supply chain" aggregating text and any supporting visualizations
Legal document review
Contracts and legal filings often contain schedules, exhibits, and referenced documents. Understanding requires following these cross-references.
Applications include:
- "What are the payment terms?" finding the relevant schedule and interpreting the table structure
- "Summarize all indemnification clauses" aggregating provisions across document sections
- "What exhibits are referenced in Section 4.2?" following explicit cross-references
Implementation blueprint
Research papers describe what works. They rarely explain how to build it. Here is the practical stack and workflow for implementing RAG-Anything in production.
The tech stack
| Layer | Options | Why |
|---|---|---|
| Document Parser | MinerU, Unstructured.io | Must return bounding boxes for images, not just extracted text. Position data is critical for cross-modal linking. |
| Graph Database | Neo4j, Amazon Neptune | Need a database that handles both structured nodes and can integrate with vector indices. Neo4j has native vector search as of 2024. |
| Vector Database | Milvus, Qdrant, Pinecone | For the dense retrieval component. Can be separate or integrated with the graph DB. |
| Vision Model | GPT-4o, Gemini 1.5 Pro, Claude | Converts images to text descriptions. Critical for making visual content searchable. |
| Orchestration | LlamaIndex, LangChain | LlamaIndex has mature PropertyGraph and GraphRAG implementations. Handles the retrieval fusion logic. |
Most document parsers strip images and return plain text. You need one that preserves spatial coordinates. When the parser tells you "Figure 1 is at coordinates (x: 450, y: 1200)" and "the text 'see Figure 1' is at (x: 100, y: 1180)", you can infer they're related. Without coordinates, you're guessing.
The build workflow
Step 1: Document ingestion
PDF → Parser → Structured Output
├── Text chunks (with coordinates)
├── Images (extracted as files)
├── Tables (as structured data)
└── Equations (as LaTeX)
Run each document through MinerU or Unstructured. The output should include:
- Text blocks with page number and bounding box
- Images saved as separate files with their coordinates
- Tables preserved as structured data, not flattened text
Step 2: Visual content processing
Pass every extracted image to a vision model:
Image → Vision Model → Text Description
"A line chart showing quarterly revenue from Q1 2023
to Q4 2024. Revenue increases from $2.3M to $4.1M
with steepest growth in Q3 2024."
This description becomes searchable. When someone asks about "revenue growth," the vector search can match this text even though the original was an image.
Step 3: Graph construction (the hard part)
Create nodes for each content type:
(:TextChunk {id, content, embedding, page, coordinates})
(:Image {id, description, embedding, file_path, page, coordinates})
(:Table {id, structured_data, summary, embedding, page})
(:Document {id, title, metadata})
The critical step is creating relationship edges. Use two signals:
-
Explicit references. Parse text for patterns like "Figure 1", "Table 3", "see chart below". Create edges:
(:TextChunk)-[:REFERENCES]->(:Image) -
Spatial proximity. If a text chunk and image are on the same page within 200 pixels, they're likely related. Create edges:
(:TextChunk)-[:NEAR]->(:Image)
// Cypher example for Neo4j
MATCH (t:TextChunk), (i:Image)
WHERE t.page = i.page
AND abs(t.y_coord - i.y_coord) < 200
CREATE (t)-[:SPATIALLY_RELATED]->(i)
Step 4: Hybrid retrieval
When a query arrives:
- Vector search. Find top-k text chunks by embedding similarity
- Graph expansion. For each matched chunk, traverse edges to find connected images, tables, equations
- Relevance scoring. Rank the expanded set by combined vector + graph signal
- Context assembly. Build the prompt with text chunks AND their connected visual descriptions
# Pseudocode
matches = vector_search(query, top_k=5)
expanded = []
for chunk in matches:
connected = graph.traverse(chunk, edge_types=['REFERENCES', 'NEAR'])
expanded.extend(connected)
context = deduplicate_and_rank(matches + expanded)
response = llm.generate(query, context)Where teams get stuck
Problem 1: Parser quality. Most parsers mangle tables and miss image coordinates. Budget time for parser evaluation.
Problem 2: Edge creation heuristics. The "200 pixels" threshold is illustrative. The actual value depends on your parser's output resolution (DPI) and document type. Legal documents have different layouts than research papers. Expect to tune this.
Problem 3: Graph query performance. Naive traversal gets slow. Index your edges. Limit traversal depth. Cache frequent patterns.
Problem 4: Context window limits. Pulling in every connected image description bloats the context. Implement relevance filtering on the expanded set.
Don't build the full dual-graph architecture on day one. Start with explicit reference parsing ("Figure 1", "Table 2"). Add spatial proximity once that works. Add the textual semantics graph last. Each layer adds value independently.
Business Implications
For RAG product teams
Architecture investment priority. If your documents contain any non-text content, single-modality RAG will systematically fail on queries that require understanding that content. RAG-Anything's approach (or similar multimodal frameworks) becomes necessary, not optional.
Quality ceiling recognition. Text-only systems hit a hard ceiling on document understanding. No amount of prompt engineering or model upgrades fixes the fundamental issue of missing visual information.
Evaluation methodology. Standard RAG benchmarks often use text-heavy datasets. Test on actual customer documents that include charts, tables, and images to understand real-world performance.
For enterprise search teams
Content audit. Survey your document corpus. What percentage contains meaningful non-text content? That percentage represents queries where current text-only systems will fail.
User expectation management. Users expect AI systems to "see" what they see in documents. When the system cannot answer questions about visual content, trust erodes.
Incremental adoption path. RAG-Anything is open source and can run alongside existing systems. Pilot on document types with high multimodal content before full rollout.
For document management vendors
Feature differentiation. Multimodal RAG moves from nice-to-have to table stakes as users encounter the limitations of text-only systems.
Integration complexity. The framework requires vision models, knowledge graph infrastructure, and hybrid retrieval. This is significantly more complex than text-only RAG pipelines.
Compute requirements. Processing images and building knowledge graphs adds computational overhead. Pricing models may need adjustment.
For AI infrastructure teams
Pipeline orchestration. RAG-Anything coordinates multiple specialized models (text, vision, graph). Infrastructure must support this heterogeneous workload.
Storage considerations. Knowledge graphs require graph database infrastructure. Vector stores alone are insufficient.
Latency budgets. Graph traversal adds retrieval time compared to pure vector search. Benchmark actual performance against user expectations.
Limitations
Computational overhead
Processing visual content and building knowledge graphs requires more compute than text-only approaches. For high-volume applications, this overhead may be significant.
The framework requires:
- Vision models for image analysis
- Graph database for relationship storage
- Multiple embedding models for different content types
Domain-specific tuning
The content analyzers work best when tuned for specific document types. A financial analyst system needs different table interpretation than a medical records system.
Out-of-the-box performance may require customization for specialized domains.
Parsing quality dependency
The system relies on its document parser for structure extraction. Poorly formatted PDFs, scanned documents with OCR errors, or unusual layouts can degrade downstream performance.
Garbage in, garbage out still applies. Document quality matters.
Graph construction latency
Building the knowledge graph takes time during document ingestion. For real-time applications where documents arrive and must be immediately queryable, this latency may be problematic.
The authors don't provide detailed latency benchmarks, which would be valuable for production planning.
Complex query handling
While the system handles straightforward multimodal queries well, highly complex reasoning chains (multi-hop questions requiring synthesis across many document sections) remain challenging.
The graph structure helps but doesn't fully solve multi-step reasoning.
Original paper: arXiv ・ PDF ・ HTML
Code: GitHub
Authors: Zirui Guo, Xubin Ren, Lingrui Xu, Jiahao Zhang, Chao Huang (Data Intelligence Lab, The University of Hong Kong)
Cite this paper
Zirui Guo, Xubin Ren, Lingrui Xu, Jiahao Zhang, Chao Huang (2025). RAG-Anything: Unified Multimodal Retrieval for Real-World Documents. arXiv 2025.