-
Problem. RAG systems store retrieved facts as isolated items in a flat list, losing the relationships between them and limiting complex reasoning.
-
Solution. HGMem represents memory as a hypergraph where “hyperedges” connect multiple related entities, with operations to update, insert, and merge memory points as reasoning progresses.
-
Results. On sense-making tasks over long documents (100K-280K tokens), HGMem achieves 55% on NarrativeQA vs 45% for DeepRAG, with consistent gains across all narrative benchmarks.
Research overview
You’ve seen it happen: you ask your RAG system a complex question, and it returns three perfectly correct, completely unrelated facts that fail to answer the actual “why.” The revenue dropped. The CEO changed. A new competitor entered. All true. None connected. The system retrieved the pieces but couldn’t assemble the puzzle.
This is the scattered evidence problem, and it’s why multi-step RAG exists.
Multi-step RAG systems don’t retrieve once and answer. They iterate: retrieve some facts, reason about what’s missing, retrieve more, and repeat until they have enough information. This is essential for complex questions where no single passage contains the full answer.
The problem is how these systems store their “working memory.” Most implementations use a flat list: retrieve passage A, add it to memory, retrieve passage B, add it to memory. The passages sit side by side without any structure connecting them.
HGMem takes a different approach. It represents memory as a hypergraph, a structure where edges can connect more than two nodes. Instead of storing isolated facts, HGMem creates “memory points” that capture relationships between multiple entities. As reasoning progresses, these memory points can merge into higher-order units that synthesize evidence from different parts of the document.
Think of a detective’s corkboard versus a notebook. A notebook stores facts linearly: “Suspect A was at the bank. The money disappeared. Suspect B owns a boat.” A corkboard connects clues with red string: the bank receipt, the boat registration, and the harbor photo all pinned together because they form one story. Standard RAG is the notebook. HGMem is the corkboard.
HGMem treats memory as a living structure that evolves, not a passive container that accumulates facts.
The flat memory problem
Consider a question about a fantasy novel: “What is the relationship between the protagonist, the antagonist, and the magical artifact?”
A flat memory system might retrieve three separate facts:
- “The protagonist seeks the Crown of Shadows”
- “The antagonist once wielded the Crown”
- “The Crown corrupts its bearer”
These facts sit in memory as isolated items. In HGMem, these three facts merge into a single hyperedge:
Hyperedge ID: HP-001
Entities: {Protagonist, Antagonist, Crown}
Description: "The protagonist seeks the
Crown of Shadows, which the antagonist
once wielded and which corrupts its
bearer."
Now the LLM reads one structured unit that captures the full relationship, instead of piecing together three unrelated snippets.
FLAT MEMORY HYPERGRAPH MEMORY
[Fact 1] [Fact 1]
[Fact 2] ↘
[Fact 3] [Composite Unit]
[Fact 4] ↗ ↖
[Fact 2] [Fact 3]
(disconnected) (connected)
When facts are disconnected, the LLM must reconstruct relationships during response generation. This works for simple queries but breaks down when the answer requires integrating many scattered facts. The model can’t “see” the structure of how evidence connects.
The paper identifies two categories of queries with different requirements:
| Query Type | Example | What’s Needed |
|---|---|---|
| Primitive | ”What was Q3 revenue?” | Find one fact |
| Sense-making | ”How did the revenue decline impact R&D?” | Connect facts across sections |
Look at your own query logs. If 80% are primitive, you probably don’t need HGMem. If you’re drowning in sense-making questions that your current RAG can’t answer, keep reading.
Flat memory handles primitive queries adequately. For sense-making queries, it forces the LLM to do heavy lifting at response time with no structural guidance.
Hypergraph memory architecture
HGMem represents memory as M = (V, E) where V contains entity nodes and E contains hyperedges that function as “memory points.”
Memory Evolution: How HGMem Builds Understanding
Watch isolated facts merge into connected knowledge across reasoning steps
Each entity node stores:
- Entity information (name, type)
- Associated text chunks from the document
Each hyperedge (memory point) stores:
- A relationship description
- References to subordinate entities (2 or more)
Why hyperedges matter
Standard knowledge graphs use binary edges: “Character A” → “defeated” → “Character B”. Complex relationships require multiple edges that lose their unity.
Binary relationships connect two things. N-ary relationships connect N things simultaneously.
Graph: A → B (one relationship)
Hypergraph: (A, B, C, D) → “transaction” (one unit)
“Alice sold the house to Bob for $500K through agent Carol” involves four entities in one relationship. Breaking this into binary edges loses the unified transaction context.
Hyperedges capture n-ary relationships directly. “Character A defeats Character B, who was enslaved to Character C, using the artifact from Character D” becomes a single memory unit that preserves the full context.
Traditional RAG stores retrieved text passages. HGMem stores structured “memory points” that capture relationships between entities. A memory point is a semantic unit with explicit connections to the entities it involves, not raw text.
Building from knowledge graphs
HGMem builds on top of a knowledge graph extracted from the document. The paper uses LightRAG’s extraction tool with GPT-4o to identify entities and relationships. This pre-processing creates the initial graph structure that HGMem’s memory evolves from.
Memory evolution operations
The hypergraph memory isn’t static. It evolves through three operations as reasoning progresses:
Update
Modifies the description of an existing memory point without changing which entities it connects. Use case: refining understanding based on new evidence.
Insertion
Creates new memory points for newly retrieved information. When retrieval finds relevant entities and relationships not yet in memory, insertion adds them as new hyperedges.
Merging
Combines existing memory points into higher-order composite units. If memory point A connects entities 2 and memory point B connects entities 3, merging creates a new memory point connecting 3 with a synthesized description.
Picture a master weaver with two strips of silk: one showing a dragon, another showing a phoenix. By interlacing the threads where the patterns overlap, the weaver creates a single tapestry where dragon and phoenix soar together. The overlapping entities are where the threads interlock; the new synthesized description is the unified scene that emerges.
Merging transforms a collection of facts into integrated knowledge. The ablation study shows removing merging causes the largest performance drops: 10% on sense-making queries where integration is essential, but near-zero impact on simple fact lookups.
The LLM decides when to merge based on the current query and memory state:
def should_merge(a, b, query):
# Check entity overlap
shared = a.entities & b.entities
if not shared:
return False
# Check semantic coherence
prompt = f"""
Memory A: {a.description}
Memory B: {b.description}
Query: {query}
Coherent unit? (yes/no)
"""
return llm(prompt) == "yes"
def merge(a, b, query):
entities = a.entities | b.entities
desc = llm(
f"Synthesize: {a.description}"
f" + {b.description}"
f" for: {query}"
)
return MemoryPoint(entities, desc)
This creates “higher-order” memory points that capture complex relationships across multiple evidence sources.
Adaptive retrieval strategy
HGMem uses two complementary retrieval modes:
Local investigation
When memory already contains relevant information, local investigation explores the neighborhood of known entities. If the memory includes “Character A” connected to several relationships, local investigation retrieves other entities and facts connected to Character A in the underlying knowledge graph.
This is efficient for deepening understanding of areas already in memory.
Global exploration
When local investigation isn’t finding new relevant information, global exploration searches regions of the knowledge graph not yet represented in memory. This prevents the system from getting stuck in local optima.
The system alternates between these modes based on what the current query requires. If the subquery asks about something already in memory, local investigation runs first. If memory lacks relevant starting points, global exploration searches unexplored regions.
Benchmark results
The researchers tested HGMem against five strong baselines across six datasets: financial reports, government documents, legal texts, and long narratives.
Single-step RAG: GraphRAG and LightRAG build knowledge graphs from documents and retrieve from them. HippoRAG v2 uses knowledge triples for retrieval. Multi-step RAG: DeepRAG and ComoRAG iterate between retrieval and reasoning but use flat memory without hypergraph structure.
HGMem Performance vs Baselines
Accuracy on long narrative understanding benchmarks
Long document sense-making
Longbench Generative QA: Response Quality
GPT-4o evaluation on 200K+ token documents
HGMem provides more complete and diverse responses on 200K+ token documents. Both metrics are GPT-4o evaluations of response quality.
Narrative understanding
The benchmark chart above shows accuracy on three narrative understanding datasets. The largest gains appear on NarrativeQA (+10%), which requires integrating information across long novels. NoCha and Prelude show smaller but consistent improvements (+2-3%).
Primitive vs sense-making queries
An ablation study removes or disables components of a system to measure their individual contribution. Here, the researchers disabled the merging operation to see how much it matters. This helps identify which parts of a system are essential vs. optional.
The paper breaks down results by query type:
Where HGMem Shines: Sense-Making vs Simple Queries
The 10% gain appears only on complex queries that require connecting facts
Finding: HGMem’s advantage concentrates on sense-making queries. On primitive queries (simple fact lookup), performance is comparable or slightly lower than the version without merging. On sense-making queries requiring integration, HGMem shows 10%+ gains.
The paper also measured entities per hyperedge: sense-making queries produce hyperedges with ~2x more entities (7.07 vs 3.35 on NarrativeQA). This confirms that the merging operation creates higher-order structures specifically when complex integration is required.
This validates the paper’s thesis: flat memory suffices for simple retrieval, but structured memory with high-order correlations matters when reasoning requires connecting scattered evidence.
Step-by-step performance
Performance by Reasoning Step
Accuracy peaks at Step 3, then diminishing returns
The chart shows HGMem vs baseline accuracy across reasoning steps. Key findings:
- Steps 1-2 show steady improvement as memory accumulates relevant facts
- Steps 4+ show flat or declining accuracy, likely due to noise accumulation
- HGMem outperforms baselines at every step
Stop at Step 3. Performance peaks here across all benchmarks. Steps 4 and 5 waste tokens without improving accuracy. The paper notes “more steps bring no further improvements at a higher cost.” Set max_steps=3 and move on.
Cost analysis
HGMem adds 15-20 seconds and 3-4x the token cost of a standard RAG call due to the iterative loop. If you’re building a real-time chatbot, this is an eternity. HGMem is for batch processing, async workflows, or use cases where accuracy matters more than speed. Budget accordingly before deploying at scale.
HGMem adds computational overhead for memory evolution:
| Dataset | Tokens/Query | Latency |
|---|---|---|
| NarrativeQA | 4,436 | 15.8s |
| NoCha | 5,253 | 18.8s |
| Prelude | 5,422 | 19.4s |
The merging operation adds approximately 6% token overhead and less than 5% latency compared to the version without merging. For complex queries where accuracy matters, this trade-off is favorable.
Implementation blueprint
The paper gives you the architecture. This section gives you the shipping path. Below you’ll find the exact stack the researchers used, the parameters that worked, the prompts that drive the merge logic, and the cost estimates that determine whether HGMem makes sense for your workload. If you’re ready to stop reading and start building, start here.
Recommended stack
The paper’s codebase reveals the tools that actually work. These are the exact choices the authors used to get published results.
| Component | Recommendation | Notes |
|---|---|---|
| KG Extraction | LightRAG tool | Paper uses this with GPT-4o |
| Embeddings | bge-m3 | Paper’s choice for entity/relation vectors |
| Vector DB | nano-vectordb | Lightweight option from paper |
| Hypergraph | hypergraph-db | Python package for hypergraph operations |
| LLM | GPT-4o or Qwen2.5-32B | Both evaluated in paper |
Key parameters
These are the hyperparameters that produced the benchmark numbers. Start here; tune later.
| Parameter | Value | Context |
|---|---|---|
| Chunk size | 200 tokens | With 50-token overlap |
| Max steps | 3 | Optimal; more steps show diminishing returns |
| Temperature | 0.8 | For memory operations |
| Max tokens | 2,048 | For response generation |
Data structures
Everything in HGMem flows through two core types: the memory point (a hyperedge) and the hypergraph that holds them. Get these right and the rest follows.
The hyperedge (memory point) schema:
interface MemoryPoint {
id: string;
entities: string[]; // 2+ entity IDs
description: string; // Synthesized relationship
source_chunks: string[]; // Original text refs
created_at_step: number;
merged_from?: string[]; // Parent memory IDs
}
interface HypergraphMemory {
nodes: Map<string, Entity>;
hyperedges: Map<string, MemoryPoint>;
}
Core workflow
The system operates in two phases: an offline indexing step where you build the knowledge graph, and an online query loop where the hypergraph memory evolves through retrieve-reason-evolve cycles.
HGMem Pipeline
From raw documents to evolved memory in 3 iterations
Implementation steps
Here’s the sequence from raw documents to working system. Each step builds on the previous one.
-
Pre-process documents into a knowledge graph using entity/relation extraction. This is done offline before queries. (Warning: This is slow and expensive. Extracting a KG from a 200-page document with GPT-4o can take minutes and cost dollars per document. Budget this upfront investment.)
-
Embed entities and relationships using bge-m3 or similar. Store in a vector database for retrieval.
-
For each query, initialize empty hypergraph memory and run the multi-step loop:
- Generate subqueries based on current memory
- Retrieve relevant entities via local/global strategy
- Evolve memory through update/insert/merge
- Check if memory suffices; if not, continue
-
Generate response using the final memory state as context.
The merge prompt
The merge operation is where HGMem’s performance gains come from. Get this prompt wrong and you’ll either over-merge (creating noise) or under-merge (missing connections). Here’s a template that balances both:
SYSTEM: You are a memory manager for a
RAG system. Your job is to decide if
two memory points should merge.
RULES:
- Merge if they describe the same event
from different angles
- Merge if combining them answers the
query better than either alone
- Do NOT merge if they're unrelated
facts that happen to share an entity
INPUT:
Memory A: {mem_a.description}
Entities: {mem_a.entities}
Memory B: {mem_b.description}
Entities: {mem_b.entities}
Query: {current_query}
OUTPUT (JSON):
{
"should_merge": true/false,
"reason": "...",
"merged_description": "..." // if true
}
The key insight: merge when facts explain each other, not just when they share entities.
Common pitfalls
These are the mistakes that won’t show up in your unit tests but will ruin production. Learn from teams that shipped before you.
1. Knowledge graph quality
If your entity extraction is bad, HGMem will hallucinate relationships faster. Garbage in, garbage out.
Think of it like carving a marble statue from a cracked block of stone. Every fissure becomes a flaw in the final sculpture. Only a flawless slab produces a statue that holds together under the chisel’s pressure. Your KG is the raw marble.
Before debugging your memory system, check your extraction pipeline. Run a sample of 50 documents through your KG builder and manually verify the output. If entities are missing or relationships are wrong, fix that first.
2. Over-merging
The LLM decides when to merge memory points. Without constraints, it might merge too aggressively, creating memory points that are too broad. The paper’s prompts (Appendix B) include guidance on when merging is appropriate.
3. Step limits
Performance plateaus around step 3. Running more steps increases cost without improving accuracy. Implement hard limits.
4. Primitive query overhead
For simple fact-lookup queries, HGMem adds unnecessary overhead. Consider routing simple queries to a faster baseline system.
5. Query routing (make this production-ready)
Nobody should run a 15-second query for “What was Q3 revenue?” Build a router:
def route_query(query: str) -> str:
# Pro-tip: Use JSON mode for
# reliability over text parsing
prompt = f"""
Classify: "{query}"
SIMPLE: one fact needed
COMPLEX: connect multiple facts
JSON: {{"type": "SIMPLE/COMPLEX"}}
"""
result = llm(
prompt,
response_format={"type": "json_object"}
)
qtype = json.loads(result)["type"]
if qtype == "SIMPLE":
return vector_rag(query) # ~1s
else:
return hgmem_rag(query) # ~15s
This hybrid approach captures HGMem’s gains where they matter while keeping simple lookups fast.
Before you build
Not every RAG system needs hypergraph memory. Run through this checklist before investing engineering time:
- Long documents? Are your docs 50+ pages where context windows struggle?
- “Why” questions? Do users ask questions requiring multi-fact synthesis?
- Can you wait 15s? Is this batch/async, not real-time chat?
If you checked all three, HGMem is worth the investment. If not, standard RAG or GraphRAG may be simpler paths.
Bill of materials
What will this cost to run? Here’s a realistic breakdown using GPT-4o pricing. Your mileage will vary based on document density and query complexity.
| Task | Cost (GPT-4o) | Time |
|---|---|---|
| Index 100-page PDF | ~$1.50 | 3-5 min |
| Index 1000 pages | ~$15 | 30-50 min |
| Single query | ~$0.10-0.15 | 15-20s |
| 100 queries/day | ~$10-15/day | — |
These are rough estimates. Actual costs depend on document density, query complexity, and how often the merge operation triggers.
MVP in an afternoon
Want to test the concept without setting up Neo4j, Docker, or cloud infrastructure? You can build a working prototype in 2-3 hours with Python’s standard library and NetworkX:
# No Docker, no Neo4j, no cloud
import networkx as nx
import json
# In-memory hypergraph
G = nx.Graph()
# Store as JSON
def save_memory(G, path="memory.json"):
data = nx.node_link_data(G)
json.dump(data, open(path, "w"))
def load_memory(path="memory.json"):
data = json.load(open(path))
return nx.node_link_graph(data)
NetworkX handles graph operations. JSON handles persistence. Add OpenAI calls for the merge logic. You can have a working prototype in 2-3 hours without spinning up infrastructure.
Test case: the conflicting facts test
Your system is running. How do you verify the merge operation actually synthesizes information instead of just concatenating? Here’s a validation test that separates working implementations from broken ones.
Feed it a document where:
- Page 1 says: “Q3 revenue increased 12%”
- Page 50 says: “After adjustments, Q3 revenue decreased 3%”
Pass: HGMem merges these into: “Q3 revenue initially appeared to increase 12% but after accounting adjustments showed a 3% decrease.”
Fail: System returns both facts as isolated statements, leaving the user confused about which is true.
This test validates that the merge operation is synthesizing, not just concatenating.
Practical applications
The benchmarks suggest several use cases that align with HGMem’s strengths.
Financial report analysis
Aligns with: Longbench Financial (266K tokens, comprehensiveness/diversity metrics)
Questions like “How did the company’s risk profile change across divisions?” require integrating information from multiple sections. HGMem’s memory can connect facts about different business units into coherent higher-order structures.
Legal document review
Aligns with: Longbench Legal (194K tokens)
Legal analysis often requires connecting precedents, statutes, and case facts. The hypergraph structure can represent multi-party relationships and evolving legal positions.
Narrative understanding
Aligns with: NarrativeQA, NoCha, Prelude (novels and long-form fiction)
Character relationships in novels involve complex n-ary relationships. “Character A defeats Character B using the artifact that Character C forged for Character D” fits naturally into a hyperedge.
Research synthesis
Not directly evaluated, but the architecture suits: Multi-document QA where evidence must be integrated across papers.
When to use HGMem
Use HGMem when:
-
Queries require integration. If your use case involves sense-making queries that connect scattered evidence, HGMem’s structured memory provides meaningful gains.
-
Documents are long. The evaluated datasets range from 140K to 280K tokens. HGMem is designed for contexts where flat memory becomes unwieldy.
-
Relationships are complex. If your domain involves n-ary relationships (legal cases with multiple parties, organizational structures, narrative plots), hypergraph representation captures structure that binary graphs lose.
-
You have good KG extraction. HGMem builds on a pre-extracted knowledge graph. If your domain supports high-quality entity/relation extraction, HGMem can leverage that structure.
Consider alternatives when:
-
Queries are mostly primitive. For simple fact lookup, the overhead of hypergraph memory isn’t justified. Standard RAG or simpler multi-step systems will perform comparably at lower cost.
-
Latency is critical. HGMem adds 15-20 seconds per query for the evaluated configurations. Real-time applications may need faster approaches.
-
KG extraction is unreliable. If entity extraction in your domain produces poor results, HGMem’s foundation will be weak. The system amplifies KG quality, good or bad.
-
Documents are short. For documents that fit comfortably in context, direct LLM processing may outperform retrieval-based approaches.
Limitations
Offline KG dependency
HGMem requires a pre-built knowledge graph. The quality of entity/relation extraction directly impacts performance. For domains where extraction is difficult (technical jargon, implicit relationships), this dependency is a constraint.
Sequential processing
Memory evolution happens step by step. Each step requires LLM calls for subquery generation, retrieval decisions, and memory operations. Parallelization opportunities are limited.
Diminishing returns
Performance plateaus at step 3 across benchmarks. The system doesn’t automatically detect when additional steps won’t help, so implementers must set hard limits.
Primitive query overhead
The paper acknowledges that for simple queries, HGMem “can introduce redundancy by incorporating higher-order correlations even when primitive facts suffice.” The system doesn’t adapt its complexity to query difficulty.
Model dependence
Results are demonstrated with GPT-4o and Qwen2.5-32B. The memory evolution prompts are tuned for these models. Adaptation to other models may require prompt engineering.
Paper: arXiv:2512.23959 Authors: Chulun Zhou, Chunkang Zhang, Guoxin Yu, Fandong Meng, Jie Zhou, Wai Lam, Mo Yu (WeChat AI, CUHK) Code: github.com/Encyclomen/HGMem Original paper: arXiv ・ PDF ・ HTML
Cite this paper
Chulun Zhou, Chunkang Zhang, Guoxin Yu, Fandong Meng, Jie Zhou, Wai Lam, Mo Yu (2025). HGMem: Hypergraph Memory for Multi-step RAG. arXiv 2025.