-
The Problem. LLM agents either store everything (expensive, slow retrieval) or summarize aggressively (lose critical details). Neither approach handles long-term conversations well.
-
The Solution. Two-tier memory inspired by human cognition: abstract "Notes" for distilled knowledge, concrete "Episodes" for raw events. Best-effort retrieval checks Notes first, falls back to Episodes only when needed. Conflict-aware reconsolidation keeps Notes accurate over time.
-
The Results. 11.7 pp higher accuracy than best baseline on LoCoMo (80.71% vs 69.03%), 3x faster retrieval than Mem0, 53% fewer tokens than A-MEM. Memory reconsolidation contributes +5.85 points to Note Memory performance.
Research Overview
If you have built a chatbot or AI assistant that needs to remember past conversations, you have faced this dilemma: store everything and drown in tokens, or summarize aggressively and lose the details that matter.
Current memory systems for LLM agents fall into two camps. Systems like MemGPT store raw conversation logs, creating a growing context that becomes expensive to search and process. Systems like A-MEM create compressed summaries, but these summaries lose nuance. When a user says "remember I prefer the blue one" six conversations ago, a summary might retain "user has color preferences" while losing the specific preference.
Memory is what separates a useful assistant from a forgetful one. Without effective long-term memory, every conversation starts from zero. The user has to re-explain their preferences, re-state their goals, and re-establish context. For agents running multi-step tasks over days or weeks, this is not just annoying, it breaks functionality.
HiMem takes inspiration from how human memory actually works. Cognitive scientists distinguish between semantic memory (abstract facts and knowledge) and episodic memory (specific experiences and events). HiMem mirrors this with two tiers: Note Memory for distilled knowledge, and Episode Memory for concrete conversation segments.
The key insight is that these two types complement each other. Notes give you fast answers to common questions ("What does this user prefer?"). Episodes give you the evidence when Notes are insufficient ("What exactly did they say about the blue one?"). HiMem's retrieval mechanism checks Note sufficiency first and only falls back to Episodes when needed.
Key results
| Metric | HiMem | Best Baseline | Improvement |
|---|---|---|---|
| LoCoMo (GPT-Score) | 80.71% | 69.03% (SeCom) | +11.68 pp |
| Single-Hop | 89.22% | 87.02% (SeCom) | +2.20 pp |
| Multi-Hop | 70.92% | 59.10% (SeCom) | +11.82 pp |
| Temporal | 74.77% | 68.54% (Mem0) | +6.23 pp |
pp = percentage points (absolute difference). GPT-Score is the primary metric using GPT-4o-mini as judge.
GPT-Score is an evaluation metric where a capable LLM (GPT-4o-mini) judges the correctness of an agent's answer, producing a percentage score. Unlike exact-match metrics, it reflects human-like judgment of semantic correctness and can handle paraphrased or contextually equivalent answers.
The efficiency numbers are equally striking: HiMem uses 53% fewer tokens than A-MEM while achieving higher accuracy, and retrieves 3x faster than Mem0.
The Memory Problem
To understand why HiMem matters, consider what happens when an agent tries to answer a question about past conversations.
Scenario 1: Raw storage approach. The agent searches through thousands of conversation turns. Most are irrelevant. The search is slow, token-expensive, and often returns fragments that lack context. "User mentioned blue" appears in 47 different conversations about 23 different topics.
Scenario 2: Summary approach. The agent queries a compressed knowledge base. The answer comes back: "User has expressed color preferences." Helpful, but not specific enough. Which color? For what? The summary compressed away the actionable detail.
Aggressive summarization is like taking meeting notes that say "discussed budget" without recording the actual numbers. You know the topic was covered, but you cannot act on it. For AI agents, this means giving vague responses when users expect specifics.
The fundamental tension is between retrieval efficiency and information preservation. Store less, search faster, but lose details. Store more, preserve details, but drown in retrieval costs.
HiMem resolves this by maintaining both layers simultaneously, with smart routing between them.
Architecture
HiMem organizes memory into two interconnected tiers, each serving a distinct purpose.
HiMem Hierarchical Memory Architecture
Two-tier memory with best-effort retrieval and conflict-aware self-evolution
Note Memory (abstract layer)
Note Memory stores distilled knowledge extracted from conversations. This includes:
- Facts about the user or domain (preferences, constraints, background)
- Preferences that guide future interactions (communication style, priorities)
- Profile information that persists across sessions (role, goals, history summary)
Notes are compact. A typical Note might be: "User prefers detailed technical explanations over high-level summaries. Mentioned working as a backend engineer on January 5th."
Episode Memory (concrete layer)
Episode Memory stores coherent conversation segments with full context. Each Episode contains:
- The actual dialogue turns (not summaries)
- Timestamps for temporal reasoning
- Links to relevant Notes (bidirectional references)
Episodes preserve the evidence that Notes summarize. When a Note says "user prefers blue," the linked Episode contains the full conversation where that preference was expressed.
Semantic linking
Notes and Episodes connect through semantic links. Each Note references the Episodes it was derived from. Each Episode references the Notes it supports. This bidirectional linking enables:
- Evidence retrieval: When a Note seems relevant but insufficient, follow links to source Episodes
- Note updating: When new Episodes contradict existing Notes, trigger reconsolidation
- Coherent responses: Combine abstract knowledge with concrete examples
Dual-Channel Segmentation
Before storing Episodes, the system must decide where one Episode ends and another begins. HiMem uses a cognitively-inspired approach: episodes should group "cognitively coherent" content.
Human memory research shows that we naturally segment experiences at event boundaries: moments when the situation changes meaningfully. HiMem implements this with two parallel detection channels.
Topic channel
The topic channel detects shifts in discourse goals. When the conversation moves from "discussing project timeline" to "planning team lunch," that is an episode boundary. The system tracks:
- Discourse goal changes (what the user is trying to accomplish)
- Subtopic transitions within a broader goal
- Entity shifts (talking about different people, objects, or concepts)
Event-surprise channel
The topic channel alone misses abrupt changes that do not shift the topic cleanly. The event-surprise channel catches these by detecting:
- Sudden intent changes ("Actually, forget that, let me ask something else")
- Emotional salience markers (frustration, excitement, urgency)
- Unexpected information that breaks prediction (the "wait, what?" moments)
OR fusion rule
An OR-fusion rule triggers a boundary if any detection signal exceeds its threshold. An AND-fusion rule would require all signals to fire simultaneously. OR-fusion is more sensitive, catching boundaries that AND-fusion would miss, better matching how humans naturally segment experiences.
An episode boundary triggers when EITHER channel fires. This is more sensitive than requiring both, which matches how human event segmentation works. A topic shift creates a boundary. A surprise moment creates a boundary. Both create a boundary.
The result is episodes that feel "coherent" to humans reviewing them later. Content within an episode relates logically. Content across episodes addresses different concerns.
Example: Segmentation in action
Conversation excerpt:
- 09:12 User: "Can you draft the project timeline?"
- 09:13 Assistant: "Sure, what's the target launch date?"
- 09:14 User: "We're aiming for Q3."
- 09:15 User: "By the way, I'm allergic to peanuts, so don't schedule the lunch meeting at the downtown café."
What happens:
- Topic channel: No boundary between 09:12-09:14 (discourse goal = project planning stays constant).
- Surprise channel: The allergy mention at 09:15 triggers a high surprise score (0.86 > 0.8 threshold).
- OR fusion: Boundary inserted before the allergy statement.
Resulting episodes:
- Episode A (09:12-09:14): Project planning turns
- Episode B (09:15): Personal preference, isolated for dietary-related retrieval
The surprise channel caught a boundary that topic analysis alone would have missed.
Best-Effort Retrieval
Most memory systems use a single retrieval strategy. HiMem introduces "best-effort" retrieval that adapts based on query complexity.
Step 1: Query Note Memory
Every query first searches Note Memory. For many questions, Notes provide sufficient answers:
- "What does the user prefer?" → Check preference Notes
- "What is the project deadline?" → Check fact Notes
- "Who is the main stakeholder?" → Check profile Notes
Notes are compact and fast to search. If Note Memory returns a confident answer, the system responds immediately.
Step 2: Check sufficiency
The system evaluates whether Note results are sufficient. Sufficiency depends on:
- Coverage: Do the Notes address the query's key aspects?
- Specificity: Is the information detailed enough to answer?
- Recency: Are the Notes current or potentially stale?
For questions like "What color does the user prefer?", a Note saying "user prefers blue for UI elements" is sufficient. For questions like "What exactly did the user say about the button color on Tuesday?", the Note is insufficient.
Step 3: Fallback to Episode Memory
When Notes are insufficient, the system queries Episode Memory for raw evidence. The semantic links from Notes guide this search: instead of searching all Episodes, the system follows links from relevant Notes to their source Episodes.
This targeted search is faster than brute-force Episode retrieval and more likely to find relevant content.
Why this matters
Best-effort retrieval prevents two failure modes:
- Over-reliance on summaries: Systems that only use summaries cannot answer detail-oriented questions
- Wasted computation: Systems that always search raw logs spend tokens on questions that Notes could answer
By checking Note sufficiency first, HiMem gets the best of both worlds: fast answers when possible, detailed answers when necessary.
Example: Retrieval in action
User query: "What color did I ask you to use for the dashboard buttons last Tuesday?"
-
Note lookup: The system finds Note N-042: "User prefers blue for UI elements (created Nov 2)." Confidence score: 0.78, below the 0.85 sufficiency threshold because the Note lacks the day-specific detail.
-
Sufficiency check: The Note does not answer "last Tuesday," so the system deems it insufficient.
-
Episode fallback: Following the Note's
source_episodeslink, the system retrieves Episode E-342:User (Nov 2, 14:07): "Let's make the dashboard buttons blue for the next release."
-
Response: "You asked for blue buttons on Tuesday, November 2nd."
The Note search was cheap (milliseconds). The Episode retrieval was targeted (one Episode, not thousands). The answer was precise.
Memory Self-Evolution
Static memory systems accumulate contradictions over time. A user might say "I prefer blue" in January and "Actually, I prefer green" in March. A naive system stores both facts. When queried, it returns confused or contradictory information.
HiMem addresses this with conflict-aware memory reconsolidation, inspired by how human memory updates itself.
In cognitive science, reconsolidation is the process where retrieved memories become malleable and can be updated before being re-stored. HiMem adapts this concept: when new evidence contradicts existing Notes, the system revises them rather than blindly accumulating conflicting facts.
Trigger: Retrieval failure detection
Self-evolution triggers when best-effort retrieval produces insufficient results. This means:
- Note Memory returned no confident answer
- Episode Memory provided new evidence not reflected in Notes
This is exactly when the system has learned something that existing Notes do not capture.
Process: Conflict detection and resolution
When new Episode evidence enters the system, it is compared against existing Notes. Three outcomes are possible:
-
Independent: New information does not relate to existing Notes. Action: ADD a new Note.
-
Extends: New information adds detail to existing Notes without contradiction. Action: UPDATE the existing Note with additional information.
-
Contradicts: New information conflicts with existing Notes. Action: REVISE the existing Note, replacing outdated information.
Example: Reconsolidation in action
January 5 (Episode E-101):
User: "I love the blue theme for the dashboard; it feels calm."
System creates Note N-001:
"User prefers blue for dashboard UI (created Jan 5)."
March 12 (Episode E-237):
User: "Since we switched to dark mode, green accents are easier on my eyes."
Conflict detection: The new Episode contains a preference for green that directly opposes the existing blue preference.
Reconsolidation steps:
- Identify conflict — color preference mismatch detected
- Create revised Note N-001′:
"User prefers green accents for dark-mode UI (updated Mar 12). Previously preferred blue (Jan 5)."
- Archive history — keep links to both source Episodes (E-101 and E-237)
Result:
- Query "What color does the user prefer now?" → returns green
- Query "What did the user originally prefer?" → follows history link to Episode E-101 and reports blue
The system automatically resolved the contradiction, updated its abstract knowledge, and retained the original evidence for auditability.
Impact
Ablation studies show reconsolidation contributes +5.85 points to average accuracy, more than any other component. Without it, memory systems degrade as conversations accumulate contradictions.
Component Ablation Study
Memory Reconsolidation contributes the most (+5.85 points), followed by Hierarchical Memory (+4.30)
Benchmark Results
HiMem was evaluated on two benchmarks designed to test long-term conversational memory.
LoCoMo benchmark
LoCoMo (Long-Context Memory) is a benchmark that evaluates conversational agents on tasks like single-hop, multi-hop, and temporal reasoning across hundreds of dialogue turns. It measures how well a system remembers and reasons over extended interactions, averaging 600 turns (~16K tokens) per conversation.
LoCoMo tests different reasoning types over multi-turn conversations:
LoCoMo Benchmark: Reasoning Category Performance
HiMem outperforms all baselines across single-hop, multi-hop, temporal, and open-domain reasoning
HiMem shows strong improvement across most reasoning categories:
- Single-hop (+2.20 pp): Direct fact retrieval benefits from Note organization
- Multi-hop (+11.82 pp): Best-effort retrieval enables complex reasoning chains across distant turns
- Temporal (+6.23 pp): Episode timestamps and segmentation support time-based queries
- Open-domain (-5.21 pp): HiMem underperforms SeCom here, possibly due to over-abstraction of external knowledge
The largest gains appear in Multi-Hop reasoning, where HiMem's two-tier structure shines. By first checking Notes and then following semantic links to Episodes, the system can efficiently trace complex reasoning chains that span many conversation turns.
Efficiency comparison
Efficiency Comparison: Latency and Token Usage
HiMem achieves 3x faster retrieval than Mem0 with 53% fewer tokens than A-MEM
HiMem is not just more accurate, it is more efficient:
| Metric | HiMem | Mem0 | A-MEM |
|---|---|---|---|
| Latency | 1.53s | 4.53s | 0.93s |
| Tokens/query | 1,272 | 1,583 | 2,700 |
A-MEM is faster but uses 2x the tokens. Mem0 uses similar tokens but is 3x slower. HiMem achieves the best accuracy while balancing both concerns.
Breaking down the numbers:
- Token savings vs A-MEM: 2,700 → 1,272 tokens = 1,428 tokens saved per query (53% reduction). At scale, this translates to significant cost savings.
- Speed vs Mem0: 4.53s → 1.53s = 3 seconds saved per query (3× faster). For real-time applications, this is the difference between "instant" and "noticeable delay."
- The tradeoff: A-MEM achieves 0.93s latency but at 2,700 tokens—over twice HiMem's cost. HiMem finds the sweet spot: fast enough for interactive use, efficient enough for production budgets.
Practical Applications
HiMem's architecture suits several real-world scenarios.
Personal AI assistants
Virtual assistants that remember user preferences across sessions. "Schedule my dentist appointment" should recall that the user prefers morning appointments and works from home on Fridays, without asking every time.
Customer service agents
Support agents handling returning customers. "I am having the same problem again" should retrieve the full history of previous issues, including what was tried and what worked, not just "customer had technical issues."
Multi-turn dialogue systems
Any conversation spanning many turns where context matters. Therapy bots, tutoring systems, project management assistants. Each benefits from memory that preserves both the summary ("student struggles with calculus") and the evidence ("here is where they got confused last Tuesday").
Enterprise knowledge agents
Agents that operate within organizations, accumulating institutional knowledge over time. Meeting summaries, project updates, team preferences. Notes provide fast organizational memory; Episodes preserve the decisions and discussions behind them.
Implementation Blueprint
The HiMem architecture can be implemented with standard components. Here is a practical approach.
Recommended stack
| Layer | Options | Notes |
|---|---|---|
| Note Storage | PostgreSQL, SQLite | Structured storage for searchable facts |
| Episode Storage | Vector DB (Qdrant, Pinecone) | Semantic search over raw segments |
| Embedding Model | text-embedding-3-small | Balance of quality and cost |
| LLM | GPT-4o, Claude 3.5 | For sufficiency checking and reconsolidation |
Core data structures
The system operates on two primary types:
class Note:
id: str
content: str # Distilled knowledge
type: str # fact, preference, profile
source_episodes: list[str]
created_at: datetime
updated_at: datetime
class Episode:
id: str
turns: list[Turn] # Raw dialogue
note_refs: list[str]
start_time: datetime
end_time: datetimeSegmentation implementation
Episode boundaries use a scoring function combining topic and surprise signals:
def should_segment(prev_turn, curr_turn):
topic_shift = compute_topic_distance(
prev_turn.embedding,
curr_turn.embedding
)
surprise = compute_surprise_score(
prev_turn.content,
curr_turn.content
)
# OR fusion: segment if either exceeds threshold
return topic_shift > 0.7 or surprise > 0.8Best-effort retrieval flow
def retrieve(query):
# Step 1: Check Note Memory
notes = search_notes(query, top_k=5)
# Step 2: Evaluate sufficiency
if is_sufficient(query, notes):
return format_response(notes)
# Step 3: Fallback to Episodes
episode_ids = get_linked_episodes(notes)
episodes = fetch_episodes(episode_ids)
return format_response(notes, episodes)Reconsolidation trigger
def process_new_episode(episode):
# Extract potential Notes from Episode
candidate_notes = extract_notes(episode)
for candidate in candidate_notes:
existing = find_related_notes(candidate)
if not existing:
# Independent: ADD
create_note(candidate, episode.id)
elif is_extension(candidate, existing):
# Extends: UPDATE
update_note(existing, candidate)
elif is_contradiction(candidate, existing):
# Contradicts: REVISE
revise_note(existing, candidate)Key parameters
These values produced the benchmark results:
| Parameter | Value | Purpose |
|---|---|---|
| Topic threshold | 0.7 | Cosine distance for segmentation |
| Surprise threshold | 0.8 | Surprise score for segmentation |
| Note retrieval top_k | 5 | Notes per query |
| Sufficiency threshold | 0.85 | Confidence for Note-only response |
| Episode link limit | 3 | Max Episodes per Note reference |
Pitfalls to avoid
1. Over-aggressive segmentation. Too many short Episodes fragment context. Start with higher thresholds and lower gradually.
2. Stale Note detection. Without reconsolidation, Notes become unreliable. Monitor contradiction rates as a health metric.
3. Circular references. Episode-Note links can create cycles. Use timestamps to determine authoritative source.
4. Token budget management. Episode retrieval can blow up context windows. Set hard limits on retrieved Episode length.
Limitations
Computational overhead
The two-tier system requires more processing than single-tier alternatives. Segmentation analysis, sufficiency checking, and reconsolidation all add compute. For high-volume applications, this overhead matters.
Cold start problem
The difficulty a system faces when it has little or no prior data about a user or topic, limiting its ability to generate useful summaries or predictions until enough interactions are collected. Common in recommender systems and now relevant for memory-augmented agents.
HiMem shines with accumulated history. New users or new topics lack the Note layer benefits. Early conversations behave like standard memory systems until Notes accumulate.
Domain sensitivity
The segmentation thresholds work well for general conversation but may need tuning for specialized domains. Technical discussions with frequent topic shifts might over-segment. Emotional conversations might under-segment.
Reconsolidation conflicts
When new evidence contradicts old Notes, the system must decide which is authoritative. The default is "newer wins," but some domains need more nuanced conflict resolution.
Evaluation limitations
The LoCoMo benchmark focuses on English conversational data. Performance on other languages, code-heavy contexts, or highly structured dialogues remains untested.
Authors: Yiming Zhang, Yifan Zeng, Zijian Lu (Zhejiang University); Shaoguang Mao, Furu Wei (Microsoft Research); Yan Xia, Wenshan Wu (Microsoft)
Cite this paper
Yiming Zhang, Yifan Zeng, Zijian Lu, Shaoguang Mao, Yan Xia, Wenshan Wu, Furu Wei (2026). HiMem: Hierarchical Memory That Actually Remembers What Matters. arXiv 2026.