HGMem: Hypergraph Memory for Multi-step RAG

TL;DR

Problem. RAG systems store retrieved facts as isolated items in a flat list, losing the relationships between them and limiting complex reasoning.
Solution. HGMem represents memory as a hypergraph where "hyperedges" connect multiple related entities, with operations to update, insert, and merge memory points as reasoning progresses.
Results. On sense-making tasks over long documents (100K-280K tokens), HGMem achieves 55% on NarrativeQA vs 45% for DeepRAG, with consistent gains across all narrative benchmarks.

Research overview

You've seen it happen: you ask your RAG system a complex question, and it returns three perfectly correct, completely unrelated facts that fail to answer the actual "why." The revenue dropped. The CEO changed. A new competitor entered. All true. None connected. The system retrieved the pieces but couldn't assemble the puzzle.

This is the scattered evidence problem, and it's why multi-step RAG exists.

What is multi-step RAG?

Multi-step RAG systems don't retrieve once and answer. They iterate: retrieve some facts, reason about what's missing, retrieve more, and repeat until they have enough information. This is essential for complex questions where no single passage contains the full answer.

The problem is how these systems store their "working memory." Most implementations use a flat list: retrieve passage A, add it to memory, retrieve passage B, add it to memory. The passages sit side by side without any structure connecting them.

HGMem takes a different approach. It represents memory as a hypergraph, a structure where edges can connect more than two nodes. Instead of storing isolated facts, HGMem creates "memory points" that capture relationships between multiple entities. As reasoning progresses, these memory points can merge into higher-order units that synthesize evidence from different parts of the document.

What is a hypergraph?

Think of a detective's corkboard versus a notebook. A notebook stores facts linearly: "Suspect A was at the bank. The money disappeared. Suspect B owns a boat." A corkboard connects clues with red string: the bank receipt, the boat registration, and the harbor photo all pinned together because they form one story. Standard RAG is the notebook. HGMem is the corkboard.

HGMem treats memory as a living structure that evolves, not a passive container that accumulates facts.

The flat memory problem

Consider a question about a fantasy novel: "What is the relationship between the protagonist, the antagonist, and the magical artifact?"

A flat memory system might retrieve three separate facts:

"The protagonist seeks the Crown of Shadows"
"The antagonist once wielded the Crown"
"The Crown corrupts its bearer"

These facts sit in memory as isolated items. In HGMem, these three facts merge into a single hyperedge:

Hyperedge ID: HP-001
Entities: {Protagonist, Antagonist, Crown}
Description: "The protagonist seeks the
  Crown of Shadows, which the antagonist
  once wielded and which corrupts its
  bearer."

Now the LLM reads one structured unit that captures the full relationship, instead of piecing together three unrelated snippets.

FLAT MEMORY              HYPERGRAPH MEMORY

  [Fact 1]                  [Fact 1]
  [Fact 2]                      ↘
  [Fact 3]               [Composite Unit]
  [Fact 4]                  ↗       ↖
                       [Fact 2]   [Fact 3]
  (disconnected)           (connected)

Why does this matter?

When facts are disconnected, the LLM must reconstruct relationships during response generation. This works for simple queries but breaks down when the answer requires integrating many scattered facts. The model can't "see" the structure of how evidence connects.

The paper identifies two categories of queries with different requirements:

Query Type	Example	What's Needed
Primitive	"What was Q3 revenue?"	Find one fact
Sense-making	"How did the revenue decline impact R&D?"	Connect facts across sections

Look at your own query logs. If 80% are primitive, you probably don't need HGMem. If you're drowning in sense-making questions that your current RAG can't answer, keep reading.

Flat memory handles primitive queries adequately. For sense-making queries, it forces the LLM to do heavy lifting at response time with no structural guidance.

Hypergraph memory architecture

HGMem represents memory as M = (V, E) where V contains entity nodes and E contains hyperedges that function as "memory points."

Memory Evolution: How HGMem Builds Understanding

Watch isolated facts merge into connected knowledge across reasoning steps

Each entity node stores:

Entity information (name, type)
Associated text chunks from the document

Each hyperedge (memory point) stores:

A relationship description
References to subordinate entities (2 or more)

Why hyperedges matter

Standard knowledge graphs use binary edges: "Character A" → "defeated" → "Character B". Complex relationships require multiple edges that lose their unity.

What are n-ary relationships?

Binary relationships connect two things. N-ary relationships connect N things simultaneously.

Graph: A → B (one relationship)

Hypergraph: (A, B, C, D) → "transaction" (one unit)

"Alice sold the house to Bob for $500K through agent Carol" involves four entities in one relationship. Breaking this into binary edges loses the unified transaction context.

Hyperedges capture n-ary relationships directly. "Character A defeats Character B, who was enslaved to Character C, using the artifact from Character D" becomes a single memory unit that preserves the full context.

Memory points vs passages

Traditional RAG stores retrieved text passages. HGMem stores structured "memory points" that capture relationships between entities. A memory point is a semantic unit with explicit connections to the entities it involves, not raw text.

Building from knowledge graphs

HGMem builds on top of a knowledge graph extracted from the document. The paper uses LightRAG's extraction tool with GPT-4o to identify entities and relationships. This pre-processing creates the initial graph structure that HGMem's memory evolves from.

Memory evolution operations

The hypergraph memory isn't static. It evolves through three operations as reasoning progresses:

Update

Modifies the description of an existing memory point without changing which entities it connects. Use case: refining understanding based on new evidence.

Insertion

Creates new memory points for newly retrieved information. When retrieval finds relevant entities and relationships not yet in memory, insertion adds them as new hyperedges.

Merging

Combines existing memory points into higher-order composite units. If memory point A connects entities 2 and memory point B connects entities 3, merging creates a new memory point connecting 3 with a synthesized description.

Picture a master weaver with two strips of silk: one showing a dragon, another showing a phoenix. By interlacing the threads where the patterns overlap, the weaver creates a single tapestry where dragon and phoenix soar together. The overlapping entities are where the threads interlock; the new synthesized description is the unified scene that emerges.

Why merging matters

Merging transforms a collection of facts into integrated knowledge. The ablation study shows removing merging causes the largest performance drops: 10% on sense-making queries where integration is essential, but near-zero impact on simple fact lookups.

The LLM decides when to merge based on the current query and memory state:

def should_merge(a, b, query):
    # Check entity overlap
    shared = a.entities & b.entities
    if not shared:
        return False
 
    # Check semantic coherence
    prompt = f"""
    Memory A: {a.description}
    Memory B: {b.description}
    Query: {query}
    Coherent unit? (yes/no)
    """
    return llm(prompt) == "yes"
 
def merge(a, b, query):
    entities = a.entities | b.entities
    desc = llm(
        f"Synthesize: {a.description}"
        f" + {b.description}"
        f" for: {query}"
    )
    return MemoryPoint(entities, desc)

This creates "higher-order" memory points that capture complex relationships across multiple evidence sources.

Adaptive retrieval strategy

HGMem uses two complementary retrieval modes:

Local investigation

When memory already contains relevant information, local investigation explores the neighborhood of known entities. If the memory includes "Character A" connected to several relationships, local investigation retrieves other entities and facts connected to Character A in the underlying knowledge graph.

This is efficient for deepening understanding of areas already in memory.

Global exploration

When local investigation isn't finding new relevant information, global exploration searches regions of the knowledge graph not yet represented in memory. This prevents the system from getting stuck in local optima.

The system alternates between these modes based on what the current query requires. If the subquery asks about something already in memory, local investigation runs first. If memory lacks relevant starting points, global exploration searches unexplored regions.

Benchmark results

The researchers tested HGMem against five strong baselines across six datasets: financial reports, government documents, legal texts, and long narratives.

Baselines compared

Single-step RAG: GraphRAG and LightRAG build knowledge graphs from documents and retrieve from them. HippoRAG v2 uses knowledge triples for retrieval. Multi-step RAG: DeepRAG and ComoRAG iterate between retrieval and reasoning but use flat memory without hypergraph structure.

HGMem Performance vs Baselines

Accuracy on long narrative understanding benchmarks

Long document sense-making

Longbench Generative QA: Response Quality

GPT-4o evaluation on 200K+ token documents

HGMem provides more complete and diverse responses on 200K+ token documents. Both metrics are GPT-4o evaluations of response quality.

Narrative understanding

The benchmark chart above shows accuracy on three narrative understanding datasets. The largest gains appear on NarrativeQA (+10%), which requires integrating information across long novels. NoCha and Prelude show smaller but consistent improvements (+2-3%).

Primitive vs sense-making queries

What is an ablation study?

An ablation study removes or disables components of a system to measure their individual contribution. Here, the researchers disabled the merging operation to see how much it matters. This helps identify which parts of a system are essential vs. optional.

The paper breaks down results by query type:

Where HGMem Shines: Sense-Making vs Simple Queries

The 10% gain appears only on complex queries that require connecting facts

Finding: HGMem's advantage concentrates on sense-making queries. On primitive queries (simple fact lookup), performance is comparable or slightly lower than the version without merging. On sense-making queries requiring integration, HGMem shows 10%+ gains.

The paper also measured entities per hyperedge: sense-making queries produce hyperedges with ~2x more entities (7.07 vs 3.35 on NarrativeQA). This confirms that the merging operation creates higher-order structures specifically when complex integration is required.

This validates the paper's thesis: flat memory suffices for simple retrieval, but structured memory with high-order correlations matters when reasoning requires connecting scattered evidence.

Step-by-step performance

Performance by Reasoning Step

Accuracy peaks at Step 3, then diminishing returns

The chart shows HGMem vs baseline accuracy across reasoning steps. Key findings:

Steps 1-2 show steady improvement as memory accumulates relevant facts
Steps 4+ show flat or declining accuracy, likely due to noise accumulation
HGMem outperforms baselines at every step

Stop at Step 3. Performance peaks here across all benchmarks. Steps 4 and 5 waste tokens without improving accuracy. The paper notes "more steps bring no further improvements at a higher cost." Set max_steps=3 and move on.

Cost analysis

Latency + Cost Warning

HGMem adds 15-20 seconds and 3-4x the token cost of a standard RAG call due to the iterative loop. If you're building a real-time chatbot, this is an eternity. HGMem is for batch processing, async workflows, or use cases where accuracy matters more than speed. Budget accordingly before deploying at scale.

HGMem adds computational overhead for memory evolution:

Dataset	Tokens/Query	Latency
NarrativeQA	4,436	15.8s
NoCha	5,253	18.8s
Prelude	5,422	19.4s

The merging operation adds approximately 6% token overhead and less than 5% latency compared to the version without merging. For complex queries where accuracy matters, this trade-off is favorable.

Implementation blueprint

The paper gives you the architecture. This section gives you the shipping path. Below you'll find the exact stack the researchers used, the parameters that worked, the prompts that drive the merge logic, and the cost estimates that determine whether HGMem makes sense for your workload. If you're ready to stop reading and start building, start here.

Recommended stack

The paper's codebase reveals the tools that actually work. These are the exact choices the authors used to get published results.

Component	Recommendation	Notes
KG Extraction	LightRAG tool	Paper uses this with GPT-4o
Embeddings	bge-m3	Paper's choice for entity/relation vectors
Vector DB	nano-vectordb	Lightweight option from paper
Hypergraph	hypergraph-db	Python package for hypergraph operations
LLM	GPT-4o or Qwen2.5-32B	Both evaluated in paper

Key parameters

These are the hyperparameters that produced the benchmark numbers. Start here; tune later.

Parameter	Value	Context
Chunk size	200 tokens	With 50-token overlap
Max steps	3	Optimal; more steps show diminishing returns
Temperature	0.8	For memory operations
Max tokens	2,048	For response generation

Data structures

Everything in HGMem flows through two core types: the memory point (a hyperedge) and the hypergraph that holds them. Get these right and the rest follows.

The hyperedge (memory point) schema:

interface MemoryPoint {
  id: string;
  entities: string[];      // 2+ entity IDs
  description: string;     // Synthesized relationship
  source_chunks: string[]; // Original text refs
  created_at_step: number;
  merged_from?: string[];  // Parent memory IDs
}
 
interface HypergraphMemory {
  nodes: Map<string, Entity>;
  hyperedges: Map<string, MemoryPoint>;
}

Core workflow

The system operates in two phases: an offline indexing step where you build the knowledge graph, and an online query loop where the hypergraph memory evolves through retrieve-reason-evolve cycles.

HGMem Pipeline

From raw documents to evolved memory in 3 iterations

Implementation steps

Here's the sequence from raw documents to working system. Each step builds on the previous one.

Pre-process documents into a knowledge graph using entity/relation extraction. This is done offline before queries. (Warning: This is slow and expensive. Extracting a KG from a 200-page document with GPT-4o can take minutes and cost dollars per document. Budget this upfront investment.)
Embed entities and relationships using bge-m3 or similar. Store in a vector database for retrieval.
For each query, initialize empty hypergraph memory and run the multi-step loop:
- Generate subqueries based on current memory
- Retrieve relevant entities via local/global strategy
- Evolve memory through update/insert/merge
- Check if memory suffices; if not, continue
Generate response using the final memory state as context.

The merge prompt

The merge operation is where HGMem's performance gains come from. Get this prompt wrong and you'll either over-merge (creating noise) or under-merge (missing connections). Here's a template that balances both:

SYSTEM: You are a memory manager for a
RAG system. Your job is to decide if
two memory points should merge.

RULES:
- Merge if they describe the same event
  from different angles
- Merge if combining them answers the
  query better than either alone
- Do NOT merge if they're unrelated
  facts that happen to share an entity

INPUT:
Memory A: {mem_a.description}
Entities: {mem_a.entities}

Memory B: {mem_b.description}
Entities: {mem_b.entities}

Query: {current_query}

OUTPUT (JSON):
{
  "should_merge": true/false,
  "reason": "...",
  "merged_description": "..." // if true
}

The key insight: merge when facts explain each other, not just when they share entities.

Common pitfalls

These are the mistakes that won't show up in your unit tests but will ruin production. Learn from teams that shipped before you.

1. Knowledge graph quality

If your entity extraction is bad, HGMem will hallucinate relationships faster. Garbage in, garbage out.

Think of it like carving a marble statue from a cracked block of stone. Every fissure becomes a flaw in the final sculpture. Only a flawless slab produces a statue that holds together under the chisel's pressure. Your KG is the raw marble.

Before debugging your memory system, check your extraction pipeline. Run a sample of 50 documents through your KG builder and manually verify the output. If entities are missing or relationships are wrong, fix that first.

2. Over-merging

The LLM decides when to merge memory points. Without constraints, it might merge too aggressively, creating memory points that are too broad. The paper's prompts (Appendix B) include guidance on when merging is appropriate.

3. Step limits

Performance plateaus around step 3. Running more steps increases cost without improving accuracy. Implement hard limits.

4. Primitive query overhead

For simple fact-lookup queries, HGMem adds unnecessary overhead. Consider routing simple queries to a faster baseline system.

5. Query routing (make this production-ready)

Nobody should run a 15-second query for "What was Q3 revenue?" Build a router:

def route_query(query: str) -> str:
    # Pro-tip: Use JSON mode for
    # reliability over text parsing
    prompt = f"""
    Classify: "{query}"
 
    SIMPLE: one fact needed
    COMPLEX: connect multiple facts
 
    JSON: {{"type": "SIMPLE/COMPLEX"}}
    """
    result = llm(
        prompt,
        response_format={"type": "json_object"}
    )
    qtype = json.loads(result)["type"]
 
    if qtype == "SIMPLE":
        return vector_rag(query)  # ~1s
    else:
        return hgmem_rag(query)   # ~15s

This hybrid approach captures HGMem's gains where they matter while keeping simple lookups fast.

Before you build

Not every RAG system needs hypergraph memory. Run through this checklist before investing engineering time:

Long documents? Are your docs 50+ pages where context windows struggle?
"Why" questions? Do users ask questions requiring multi-fact synthesis?
Can you wait 15s? Is this batch/async, not real-time chat?

If you checked all three, HGMem is worth the investment. If not, standard RAG or GraphRAG may be simpler paths.

Bill of materials

What will this cost to run? Here's a realistic breakdown using GPT-4o pricing. Your mileage will vary based on document density and query complexity.

Task	Cost (GPT-4o)	Time
Index 100-page PDF	~$1.50	3-5 min
Index 1000 pages	~$15	30-50 min
Single query	~$0.10-0.15	15-20s
100 queries/day	~$10-15/day	—

These are rough estimates. Actual costs depend on document density, query complexity, and how often the merge operation triggers.

MVP in an afternoon

Want to test the concept without setting up Neo4j, Docker, or cloud infrastructure? You can build a working prototype in 2-3 hours with Python's standard library and NetworkX:

# No Docker, no Neo4j, no cloud
import networkx as nx
import json
 
# In-memory hypergraph
G = nx.Graph()
 
# Store as JSON
def save_memory(G, path="memory.json"):
    data = nx.node_link_data(G)
    json.dump(data, open(path, "w"))
 
def load_memory(path="memory.json"):
    data = json.load(open(path))
    return nx.node_link_graph(data)

NetworkX handles graph operations. JSON handles persistence. Add OpenAI calls for the merge logic. You can have a working prototype in 2-3 hours without spinning up infrastructure.

Test case: the conflicting facts test

Your system is running. How do you verify the merge operation actually synthesizes information instead of just concatenating? Here's a validation test that separates working implementations from broken ones.

Feed it a document where:

Page 1 says: "Q3 revenue increased 12%"
Page 50 says: "After adjustments, Q3 revenue decreased 3%"

Pass: HGMem merges these into: "Q3 revenue initially appeared to increase 12% but after accounting adjustments showed a 3% decrease."

Fail: System returns both facts as isolated statements, leaving the user confused about which is true.

This test validates that the merge operation is synthesizing, not just concatenating.

Practical applications

The benchmarks suggest several use cases that align with HGMem's strengths.

Financial report analysis

Aligns with: Longbench Financial (266K tokens, comprehensiveness/diversity metrics)

Questions like "How did the company's risk profile change across divisions?" require integrating information from multiple sections. HGMem's memory can connect facts about different business units into coherent higher-order structures.

Legal document review

Aligns with: Longbench Legal (194K tokens)

Legal analysis often requires connecting precedents, statutes, and case facts. The hypergraph structure can represent multi-party relationships and evolving legal positions.

Narrative understanding

Aligns with: NarrativeQA, NoCha, Prelude (novels and long-form fiction)

Character relationships in novels involve complex n-ary relationships. "Character A defeats Character B using the artifact that Character C forged for Character D" fits naturally into a hyperedge.

Research synthesis

Not directly evaluated, but the architecture suits: Multi-document QA where evidence must be integrated across papers.

When to use HGMem

Use HGMem when:

Queries require integration. If your use case involves sense-making queries that connect scattered evidence, HGMem's structured memory provides meaningful gains.
Documents are long. The evaluated datasets range from 140K to 280K tokens. HGMem is designed for contexts where flat memory becomes unwieldy.
Relationships are complex. If your domain involves n-ary relationships (legal cases with multiple parties, organizational structures, narrative plots), hypergraph representation captures structure that binary graphs lose.
You have good KG extraction. HGMem builds on a pre-extracted knowledge graph. If your domain supports high-quality entity/relation extraction, HGMem can leverage that structure.

Consider alternatives when:

Queries are mostly primitive. For simple fact lookup, the overhead of hypergraph memory isn't justified. Standard RAG or simpler multi-step systems will perform comparably at lower cost.
Latency is critical. HGMem adds 15-20 seconds per query for the evaluated configurations. Real-time applications may need faster approaches.
KG extraction is unreliable. If entity extraction in your domain produces poor results, HGMem's foundation will be weak. The system amplifies KG quality, good or bad.
Documents are short. For documents that fit comfortably in context, direct LLM processing may outperform retrieval-based approaches.

Limitations

Offline KG dependency

HGMem requires a pre-built knowledge graph. The quality of entity/relation extraction directly impacts performance. For domains where extraction is difficult (technical jargon, implicit relationships), this dependency is a constraint.

Sequential processing

Memory evolution happens step by step. Each step requires LLM calls for subquery generation, retrieval decisions, and memory operations. Parallelization opportunities are limited.

Diminishing returns

Performance plateaus at step 3 across benchmarks. The system doesn't automatically detect when additional steps won't help, so implementers must set hard limits.

Primitive query overhead

The paper acknowledges that for simple queries, HGMem "can introduce redundancy by incorporating higher-order correlations even when primitive facts suffice." The system doesn't adapt its complexity to query difficulty.

Model dependence

Results are demonstrated with GPT-4o and Qwen2.5-32B. The memory evolution prompts are tuned for these models. Adaptation to other models may require prompt engineering.

Paper: arXiv:2512.23959 Authors: Chulun Zhou, Chunkang Zhang, Guoxin Yu, Fandong Meng, Jie Zhou, Wai Lam, Mo Yu (WeChat AI, CUHK) Code: github.com/Encyclomen/HGMem Original paper: arXiv ・ PDF ・ HTML

Authors

Chulun Zhou, Chunkang Zhang, Guoxin Yu, Fandong Meng, Jie Zhou, Wai Lam, Mo YuWeChat AI, The Chinese University of Hong Kong

Cite this paper

Chulun Zhou, Chunkang Zhang, Guoxin Yu, Fandong Meng, Jie Zhou, Wai Lam, Mo Yu (2025). HGMem: Hypergraph Memory for Multi-step RAG. arXiv 2025.

Key Findings

Research overview

The flat memory problem

Hypergraph memory architecture

Memory Evolution: How HGMem Builds Understanding

Why hyperedges matter

Building from knowledge graphs

Memory evolution operations

Update

Insertion

Merging

Adaptive retrieval strategy

Local investigation

Global exploration

Benchmark results

HGMem Performance vs Baselines

Long document sense-making

Longbench Generative QA: Response Quality

Narrative understanding

Primitive vs sense-making queries

Where HGMem Shines: Sense-Making vs Simple Queries

Step-by-step performance

Performance by Reasoning Step

Cost analysis

Implementation blueprint

Recommended stack

Key parameters

Data structures

Core workflow

HGMem Pipeline

Implementation steps

The merge prompt

Common pitfalls

Before you build

Bill of materials

MVP in an afternoon

Test case: the conflicting facts test

Practical applications

Financial report analysis

Legal document review

Narrative understanding

Research synthesis

When to use HGMem

Use HGMem when:

Consider alternatives when:

Limitations

Offline KG dependency

Sequential processing

Diminishing returns

Primitive query overhead

Model dependence

Authors

Cite this paper

Related Research

PaperScout: Teaching AI to Search Like a Researcher

ViDoRe V3: The Benchmark That Exposes What Your RAG Pipeline Cannot See

CIEA: How to Search for Visual Details Your Captions Miss