HiMem: Hierarchical Memory That Actually Remembers What Matters

TL;DR

The Problem. LLM agents either store everything (expensive, slow retrieval) or summarize aggressively (lose critical details). Neither approach handles long-term conversations well.
The Solution. Two-tier memory inspired by human cognition: abstract "Notes" for distilled knowledge, concrete "Episodes" for raw events. Best-effort retrieval checks Notes first, falls back to Episodes only when needed. Conflict-aware reconsolidation keeps Notes accurate over time.
The Results. 11.7 pp higher accuracy than best baseline on LoCoMo (80.71% vs 69.03%), 3x faster retrieval than Mem0, 53% fewer tokens than A-MEM. Memory reconsolidation contributes +5.85 points to Note Memory performance.

Research Overview

If you have built a chatbot or AI assistant that needs to remember past conversations, you have faced this dilemma: store everything and drown in tokens, or summarize aggressively and lose the details that matter.

Current memory systems for LLM agents fall into two camps. Systems like MemGPT store raw conversation logs, creating a growing context that becomes expensive to search and process. Systems like A-MEM create compressed summaries, but these summaries lose nuance. When a user says "remember I prefer the blue one" six conversations ago, a summary might retain "user has color preferences" while losing the specific preference.

Why does this matter?

Memory is what separates a useful assistant from a forgetful one. Without effective long-term memory, every conversation starts from zero. The user has to re-explain their preferences, re-state their goals, and re-establish context. For agents running multi-step tasks over days or weeks, this is not just annoying, it breaks functionality.

HiMem takes inspiration from how human memory actually works. Cognitive scientists distinguish between semantic memory (abstract facts and knowledge) and episodic memory (specific experiences and events). HiMem mirrors this with two tiers: Note Memory for distilled knowledge, and Episode Memory for concrete conversation segments.

The key insight is that these two types complement each other. Notes give you fast answers to common questions ("What does this user prefer?"). Episodes give you the evidence when Notes are insufficient ("What exactly did they say about the blue one?"). HiMem's retrieval mechanism checks Note sufficiency first and only falls back to Episodes when needed.

Key results

Metric	HiMem	Best Baseline	Improvement
LoCoMo (GPT-Score)	80.71%	69.03% (SeCom)	+11.68 pp
Single-Hop	89.22%	87.02% (SeCom)	+2.20 pp
Multi-Hop	70.92%	59.10% (SeCom)	+11.82 pp
Temporal	74.77%	68.54% (Mem0)	+6.23 pp

pp = percentage points (absolute difference). GPT-Score is the primary metric using GPT-4o-mini as judge.

What is GPT-Score?

GPT-Score is an evaluation metric where a capable LLM (GPT-4o-mini) judges the correctness of an agent's answer, producing a percentage score. Unlike exact-match metrics, it reflects human-like judgment of semantic correctness and can handle paraphrased or contextually equivalent answers.

The efficiency numbers are equally striking: HiMem uses 53% fewer tokens than A-MEM while achieving higher accuracy, and retrieves 3x faster than Mem0.

The Memory Problem

To understand why HiMem matters, consider what happens when an agent tries to answer a question about past conversations.

Scenario 1: Raw storage approach. The agent searches through thousands of conversation turns. Most are irrelevant. The search is slow, token-expensive, and often returns fragments that lack context. "User mentioned blue" appears in 47 different conversations about 23 different topics.

Scenario 2: Summary approach. The agent queries a compressed knowledge base. The answer comes back: "User has expressed color preferences." Helpful, but not specific enough. Which color? For what? The summary compressed away the actionable detail.

The "lost in compression" problem

Aggressive summarization is like taking meeting notes that say "discussed budget" without recording the actual numbers. You know the topic was covered, but you cannot act on it. For AI agents, this means giving vague responses when users expect specifics.

The fundamental tension is between retrieval efficiency and information preservation. Store less, search faster, but lose details. Store more, preserve details, but drown in retrieval costs.

HiMem resolves this by maintaining both layers simultaneously, with smart routing between them.

Architecture

HiMem organizes memory into two interconnected tiers, each serving a distinct purpose.

HiMem Hierarchical Memory Architecture

Two-tier memory with best-effort retrieval and conflict-aware self-evolution

Note Memory (abstract layer)

Note Memory stores distilled knowledge extracted from conversations. This includes:

Facts about the user or domain (preferences, constraints, background)
Preferences that guide future interactions (communication style, priorities)
Profile information that persists across sessions (role, goals, history summary)

Notes are compact. A typical Note might be: "User prefers detailed technical explanations over high-level summaries. Mentioned working as a backend engineer on January 5th."

Episode Memory (concrete layer)

Episode Memory stores coherent conversation segments with full context. Each Episode contains:

The actual dialogue turns (not summaries)
Timestamps for temporal reasoning
Links to relevant Notes (bidirectional references)

Episodes preserve the evidence that Notes summarize. When a Note says "user prefers blue," the linked Episode contains the full conversation where that preference was expressed.

Semantic linking

Notes and Episodes connect through semantic links. Each Note references the Episodes it was derived from. Each Episode references the Notes it supports. This bidirectional linking enables:

Evidence retrieval: When a Note seems relevant but insufficient, follow links to source Episodes
Note updating: When new Episodes contradict existing Notes, trigger reconsolidation
Coherent responses: Combine abstract knowledge with concrete examples

Dual-Channel Segmentation

Before storing Episodes, the system must decide where one Episode ends and another begins. HiMem uses a cognitively-inspired approach: episodes should group "cognitively coherent" content.

Human memory research shows that we naturally segment experiences at event boundaries: moments when the situation changes meaningfully. HiMem implements this with two parallel detection channels.

Topic channel

The topic channel detects shifts in discourse goals. When the conversation moves from "discussing project timeline" to "planning team lunch," that is an episode boundary. The system tracks:

Discourse goal changes (what the user is trying to accomplish)
Subtopic transitions within a broader goal
Entity shifts (talking about different people, objects, or concepts)

Event-surprise channel

The topic channel alone misses abrupt changes that do not shift the topic cleanly. The event-surprise channel catches these by detecting:

Sudden intent changes ("Actually, forget that, let me ask something else")
Emotional salience markers (frustration, excitement, urgency)
Unexpected information that breaks prediction (the "wait, what?" moments)

OR fusion rule

OR fusion vs AND fusion

An OR-fusion rule triggers a boundary if any detection signal exceeds its threshold. An AND-fusion rule would require all signals to fire simultaneously. OR-fusion is more sensitive, catching boundaries that AND-fusion would miss, better matching how humans naturally segment experiences.

An episode boundary triggers when EITHER channel fires. This is more sensitive than requiring both, which matches how human event segmentation works. A topic shift creates a boundary. A surprise moment creates a boundary. Both create a boundary.

The result is episodes that feel "coherent" to humans reviewing them later. Content within an episode relates logically. Content across episodes addresses different concerns.

Example: Segmentation in action

Conversation excerpt:

09:12 User: "Can you draft the project timeline?"
09:13 Assistant: "Sure, what's the target launch date?"
09:14 User: "We're aiming for Q3."
09:15 User: "By the way, I'm allergic to peanuts, so don't schedule the lunch meeting at the downtown café."

What happens:

Topic channel: No boundary between 09:12-09:14 (discourse goal = project planning stays constant).
Surprise channel: The allergy mention at 09:15 triggers a high surprise score (0.86 > 0.8 threshold).
OR fusion: Boundary inserted before the allergy statement.

Resulting episodes:

Episode A (09:12-09:14): Project planning turns
Episode B (09:15): Personal preference, isolated for dietary-related retrieval

The surprise channel caught a boundary that topic analysis alone would have missed.

Best-Effort Retrieval

Most memory systems use a single retrieval strategy. HiMem introduces "best-effort" retrieval that adapts based on query complexity.

Step 1: Query Note Memory

Every query first searches Note Memory. For many questions, Notes provide sufficient answers:

"What does the user prefer?" → Check preference Notes
"What is the project deadline?" → Check fact Notes
"Who is the main stakeholder?" → Check profile Notes

Notes are compact and fast to search. If Note Memory returns a confident answer, the system responds immediately.

Step 2: Check sufficiency

The system evaluates whether Note results are sufficient. Sufficiency depends on:

Coverage: Do the Notes address the query's key aspects?
Specificity: Is the information detailed enough to answer?
Recency: Are the Notes current or potentially stale?

For questions like "What color does the user prefer?", a Note saying "user prefers blue for UI elements" is sufficient. For questions like "What exactly did the user say about the button color on Tuesday?", the Note is insufficient.

Step 3: Fallback to Episode Memory

When Notes are insufficient, the system queries Episode Memory for raw evidence. The semantic links from Notes guide this search: instead of searching all Episodes, the system follows links from relevant Notes to their source Episodes.

This targeted search is faster than brute-force Episode retrieval and more likely to find relevant content.

Why this matters

Best-effort retrieval prevents two failure modes:

Over-reliance on summaries: Systems that only use summaries cannot answer detail-oriented questions
Wasted computation: Systems that always search raw logs spend tokens on questions that Notes could answer

By checking Note sufficiency first, HiMem gets the best of both worlds: fast answers when possible, detailed answers when necessary.

Example: Retrieval in action

User query: "What color did I ask you to use for the dashboard buttons last Tuesday?"

Note lookup: The system finds Note N-042: "User prefers blue for UI elements (created Nov 2)." Confidence score: 0.78, below the 0.85 sufficiency threshold because the Note lacks the day-specific detail.
Sufficiency check: The Note does not answer "last Tuesday," so the system deems it insufficient.
Episode fallback: Following the Note's source_episodes link, the system retrieves Episode E-342:

User (Nov 2, 14:07): "Let's make the dashboard buttons blue for the next release."
Response: "You asked for blue buttons on Tuesday, November 2nd."

The Note search was cheap (milliseconds). The Episode retrieval was targeted (one Episode, not thousands). The answer was precise.

Memory Self-Evolution

Static memory systems accumulate contradictions over time. A user might say "I prefer blue" in January and "Actually, I prefer green" in March. A naive system stores both facts. When queried, it returns confused or contradictory information.

HiMem addresses this with conflict-aware memory reconsolidation, inspired by how human memory updates itself.

Memory reconsolidation

In cognitive science, reconsolidation is the process where retrieved memories become malleable and can be updated before being re-stored. HiMem adapts this concept: when new evidence contradicts existing Notes, the system revises them rather than blindly accumulating conflicting facts.

Trigger: Retrieval failure detection

Self-evolution triggers when best-effort retrieval produces insufficient results. This means:

Note Memory returned no confident answer
Episode Memory provided new evidence not reflected in Notes

This is exactly when the system has learned something that existing Notes do not capture.

Process: Conflict detection and resolution

When new Episode evidence enters the system, it is compared against existing Notes. Three outcomes are possible:

Independent: New information does not relate to existing Notes. Action: ADD a new Note.
Extends: New information adds detail to existing Notes without contradiction. Action: UPDATE the existing Note with additional information.
Contradicts: New information conflicts with existing Notes. Action: REVISE the existing Note, replacing outdated information.

Example: Reconsolidation in action

January 5 (Episode E-101):

User: "I love the blue theme for the dashboard; it feels calm."

System creates Note N-001:

"User prefers blue for dashboard UI (created Jan 5)."

March 12 (Episode E-237):

User: "Since we switched to dark mode, green accents are easier on my eyes."

Conflict detection: The new Episode contains a preference for green that directly opposes the existing blue preference.

Reconsolidation steps:

Identify conflict — color preference mismatch detected
Create revised Note N-001′:

"User prefers green accents for dark-mode UI (updated Mar 12). Previously preferred blue (Jan 5)."
Archive history — keep links to both source Episodes (E-101 and E-237)

Result:

Query "What color does the user prefer now?" → returns green
Query "What did the user originally prefer?" → follows history link to Episode E-101 and reports blue

The system automatically resolved the contradiction, updated its abstract knowledge, and retained the original evidence for auditability.

Impact

Ablation studies show reconsolidation contributes +5.85 points to average accuracy, more than any other component. Without it, memory systems degrade as conversations accumulate contradictions.

Component Ablation Study

Memory Reconsolidation contributes the most (+5.85 points), followed by Hierarchical Memory (+4.30)

Benchmark Results

HiMem was evaluated on two benchmarks designed to test long-term conversational memory.

LoCoMo benchmark

What is LoCoMo?

LoCoMo (Long-Context Memory) is a benchmark that evaluates conversational agents on tasks like single-hop, multi-hop, and temporal reasoning across hundreds of dialogue turns. It measures how well a system remembers and reasons over extended interactions, averaging 600 turns (~16K tokens) per conversation.

LoCoMo tests different reasoning types over multi-turn conversations:

LoCoMo Benchmark: Reasoning Category Performance

HiMem outperforms all baselines across single-hop, multi-hop, temporal, and open-domain reasoning

HiMem shows strong improvement across most reasoning categories:

Single-hop (+2.20 pp): Direct fact retrieval benefits from Note organization
Multi-hop (+11.82 pp): Best-effort retrieval enables complex reasoning chains across distant turns
Temporal (+6.23 pp): Episode timestamps and segmentation support time-based queries
Open-domain (-5.21 pp): HiMem underperforms SeCom here, possibly due to over-abstraction of external knowledge

The largest gains appear in Multi-Hop reasoning, where HiMem's two-tier structure shines. By first checking Notes and then following semantic links to Episodes, the system can efficiently trace complex reasoning chains that span many conversation turns.

Efficiency comparison

Efficiency Comparison: Latency and Token Usage

HiMem achieves 3x faster retrieval than Mem0 with 53% fewer tokens than A-MEM

HiMem is not just more accurate, it is more efficient:

Metric	HiMem	Mem0	A-MEM
Latency	1.53s	4.53s	0.93s
Tokens/query	1,272	1,583	2,700

A-MEM is faster but uses 2x the tokens. Mem0 uses similar tokens but is 3x slower. HiMem achieves the best accuracy while balancing both concerns.

Breaking down the numbers:

Token savings vs A-MEM: 2,700 → 1,272 tokens = 1,428 tokens saved per query (53% reduction). At scale, this translates to significant cost savings.
Speed vs Mem0: 4.53s → 1.53s = 3 seconds saved per query (3× faster). For real-time applications, this is the difference between "instant" and "noticeable delay."
The tradeoff: A-MEM achieves 0.93s latency but at 2,700 tokens—over twice HiMem's cost. HiMem finds the sweet spot: fast enough for interactive use, efficient enough for production budgets.

Practical Applications

HiMem's architecture suits several real-world scenarios.

Personal AI assistants

Virtual assistants that remember user preferences across sessions. "Schedule my dentist appointment" should recall that the user prefers morning appointments and works from home on Fridays, without asking every time.

Customer service agents

Support agents handling returning customers. "I am having the same problem again" should retrieve the full history of previous issues, including what was tried and what worked, not just "customer had technical issues."

Multi-turn dialogue systems

Any conversation spanning many turns where context matters. Therapy bots, tutoring systems, project management assistants. Each benefits from memory that preserves both the summary ("student struggles with calculus") and the evidence ("here is where they got confused last Tuesday").

Enterprise knowledge agents

Agents that operate within organizations, accumulating institutional knowledge over time. Meeting summaries, project updates, team preferences. Notes provide fast organizational memory; Episodes preserve the decisions and discussions behind them.

Implementation Blueprint

The HiMem architecture can be implemented with standard components. Here is a practical approach.

Recommended stack

Layer	Options	Notes
Note Storage	PostgreSQL, SQLite	Structured storage for searchable facts
Episode Storage	Vector DB (Qdrant, Pinecone)	Semantic search over raw segments
Embedding Model	text-embedding-3-small	Balance of quality and cost
LLM	GPT-4o, Claude 3.5	For sufficiency checking and reconsolidation

Core data structures

The system operates on two primary types:

class Note:
    id: str
    content: str  # Distilled knowledge
    type: str     # fact, preference, profile
    source_episodes: list[str]
    created_at: datetime
    updated_at: datetime
 
class Episode:
    id: str
    turns: list[Turn]  # Raw dialogue
    note_refs: list[str]
    start_time: datetime
    end_time: datetime

Segmentation implementation

Episode boundaries use a scoring function combining topic and surprise signals:

def should_segment(prev_turn, curr_turn):
    topic_shift = compute_topic_distance(
        prev_turn.embedding,
        curr_turn.embedding
    )
    surprise = compute_surprise_score(
        prev_turn.content,
        curr_turn.content
    )
    # OR fusion: segment if either exceeds threshold
    return topic_shift > 0.7 or surprise > 0.8

Best-effort retrieval flow

def retrieve(query):
    # Step 1: Check Note Memory
    notes = search_notes(query, top_k=5)
 
    # Step 2: Evaluate sufficiency
    if is_sufficient(query, notes):
        return format_response(notes)
 
    # Step 3: Fallback to Episodes
    episode_ids = get_linked_episodes(notes)
    episodes = fetch_episodes(episode_ids)
 
    return format_response(notes, episodes)

Reconsolidation trigger

def process_new_episode(episode):
    # Extract potential Notes from Episode
    candidate_notes = extract_notes(episode)
 
    for candidate in candidate_notes:
        existing = find_related_notes(candidate)
        if not existing:
            # Independent: ADD
            create_note(candidate, episode.id)
        elif is_extension(candidate, existing):
            # Extends: UPDATE
            update_note(existing, candidate)
        elif is_contradiction(candidate, existing):
            # Contradicts: REVISE
            revise_note(existing, candidate)

Key parameters

These values produced the benchmark results:

Parameter	Value	Purpose
Topic threshold	0.7	Cosine distance for segmentation
Surprise threshold	0.8	Surprise score for segmentation
Note retrieval top_k	5	Notes per query
Sufficiency threshold	0.85	Confidence for Note-only response
Episode link limit	3	Max Episodes per Note reference

Pitfalls to avoid

1. Over-aggressive segmentation. Too many short Episodes fragment context. Start with higher thresholds and lower gradually.

2. Stale Note detection. Without reconsolidation, Notes become unreliable. Monitor contradiction rates as a health metric.

3. Circular references. Episode-Note links can create cycles. Use timestamps to determine authoritative source.

4. Token budget management. Episode retrieval can blow up context windows. Set hard limits on retrieved Episode length.

Limitations

Computational overhead

The two-tier system requires more processing than single-tier alternatives. Segmentation analysis, sufficiency checking, and reconsolidation all add compute. For high-volume applications, this overhead matters.

Cold start problem

What is the cold-start problem?

The difficulty a system faces when it has little or no prior data about a user or topic, limiting its ability to generate useful summaries or predictions until enough interactions are collected. Common in recommender systems and now relevant for memory-augmented agents.

HiMem shines with accumulated history. New users or new topics lack the Note layer benefits. Early conversations behave like standard memory systems until Notes accumulate.

Domain sensitivity

The segmentation thresholds work well for general conversation but may need tuning for specialized domains. Technical discussions with frequent topic shifts might over-segment. Emotional conversations might under-segment.

Reconsolidation conflicts

When new evidence contradicts old Notes, the system must decide which is authoritative. The default is "newer wins," but some domains need more nuanced conflict resolution.

Evaluation limitations

The LoCoMo benchmark focuses on English conversational data. Performance on other languages, code-heavy contexts, or highly structured dialogues remains untested.

Original paper: arXiv | PDF

Authors: Yiming Zhang, Yifan Zeng, Zijian Lu (Zhejiang University); Shaoguang Mao, Furu Wei (Microsoft Research); Yan Xia, Wenshan Wu (Microsoft)

Authors

Yiming ZhangZhejiang University,Yifan ZengZhejiang University,Zijian LuZhejiang University,Shaoguang MaoMicrosoft Research,Yan XiaMicrosoft,Wenshan WuMicrosoft,Furu WeiMicrosoft Research

Cite this paper

Yiming Zhang, Yifan Zeng, Zijian Lu, Shaoguang Mao, Yan Xia, Wenshan Wu, Furu Wei (2026). HiMem: Hierarchical Memory That Actually Remembers What Matters. arXiv 2026.

Key Findings