-
The Problem. LLM agents accumulate conversation history that explodes token costs. A 200-turn conversation consumes ~17,000 tokens per query. Most of that content is redundant chit-chat that actually hurts reasoning accuracy.
-
The Solution. SimpleMem compresses memories through three stages: filter low-value content, consolidate related facts into abstractions, and adapt retrieval depth to query complexity. It's a drop-in runtime layer (no fine-tuning or retraining required) that wraps your existing LLM API calls.
-
The Results. 30x token reduction (550 vs 17,000 tokens), 26% accuracy improvement over Mem0, and 4x faster processing. Works with any LLM from GPT-4 down to 1.5B parameter models.
Research Overview
Every LLM agent faces the same problem: conversations accumulate, context windows fill up, and costs spiral.
Consider a customer support agent. After 50 interactions with a user, the agent has accumulated hours of conversation history. The naive approach is to stuff everything into the context window. This works until it doesn't. Token costs become prohibitive. Worse, the model's reasoning degrades as it drowns in irrelevant details from conversations that happened weeks ago.
An LLM agent is an AI system that maintains state across multiple interactions and can take actions. Unlike a simple chatbot that treats each message independently, an agent remembers past conversations, tracks user preferences, and builds on previous context. Examples include personal AI assistants, autonomous coding agents, and customer support systems.
SimpleMem addresses this with a memory framework based on semantic lossless compression. The key insight is that most conversation content is noise. Greetings, confirmations, small talk, and repetitive exchanges consume tokens without adding value. SimpleMem filters this noise at ingestion, consolidates related facts into compact representations, and retrieves only what each query actually needs.
Key results
The paper evaluates SimpleMem on LoCoMo, a benchmark designed for long-term conversational memory with 200-400 turn conversations.
LoCoMo (Long-Context Memory) is a synthetic benchmark that simulates multi-turn dialogues (200-400 turns) to test how well an LLM agent retains and retrieves information over long conversations. It stresses both token budget and reasoning accuracy, making it a standard yardstick for memory-compression techniques.
| Metric | SimpleMem | Mem0 | Full Context |
|---|---|---|---|
| Average F1 | 43.24 | 34.20 | 18.70 |
| Token Cost | 531 | 973 | 16,910 |
| Total Time | 480s | 1,934s | - |
F1 combines precision (how many answers were correct) and recall (how many correct answers were found). Think of it as measuring both accuracy and completeness. A score of 43.24 means the system gets roughly 43% of questions fully correct, which is strong performance for open-ended conversational memory tasks.
SimpleMem achieves higher accuracy with fewer tokens. The full-context baseline (stuffing everything into the prompt) performs worst despite using 30x more tokens, confirming that more context is not better context.
What this means for your budget: At GPT-4o pricing ($2.50 per million input tokens), SimpleMem reduces per-query costs from ~$0.042 (17K tokens) to ~$0.0014 (550 tokens). That's a 97% cost reduction. For an agent handling 10,000 queries per day, that's $400 saved daily.
The Memory Problem
Current memory systems for LLM agents fall into two camps. Neither works well.
Full-context extension keeps everything. Every message, every response, every "sounds good" and "got it" stays in memory. This approach suffers from three problems:
-
Token costs explode. At $10-15 per million input tokens for frontier models, a 17,000 token context for every query adds up fast.
-
Accuracy degrades. Research shows that LLMs struggle with "lost in the middle" effects. Important information buried in long contexts gets ignored.
-
Latency increases. More tokens mean slower inference. Users notice.
Iterative filtering uses the LLM itself to decide what to keep. Systems like MemGPT make multiple inference calls to summarize and prune memories. This reduces storage but creates a new problem: the filtering process itself costs tokens. You're paying to decide what not to pay for.
Most human conversation is not information-dense. We say "hi," ask "how are you," confirm understanding with "got it," and fill silences with small talk. In text, this might be 30-50% of all messages. An agent that stores everything is storing mostly noise.
SimpleMem takes a different approach: filter at ingestion time using cheap heuristics, not expensive LLM calls. The three-stage pipeline processes content once when it arrives, not repeatedly when it's retrieved.
The temporal problem
Long-running agents face a unique challenge: relative time references become ambiguous.
When a user says "last Friday" during a conversation on January 3rd, they mean December 27th. But if the agent retrieves this memory on January 15th, "last Friday" could be interpreted as January 10th. This ambiguity breaks temporal reasoning.
SimpleMem solves this by converting all relative temporal expressions to absolute ISO-8601 timestamps at ingestion time:
// Before (ambiguous)
{"text": "He said he'd finish it by next Friday"}
// After (unambiguous)
{
"text": "Bob committed to finishing report",
"timestamp": "2026-01-10T17:00:00Z",
"entities": ["Bob", "report"]
}"Last Friday" becomes "2025-12-27" and stays correct regardless of when it's retrieved.
Architecture
SimpleMem operates through three stages that progressively refine raw conversation into compact, queryable memory.
SimpleMem Architecture
Three stages compress, consolidate, and retrieve memory efficiently
The design is inspired by Complementary Learning Systems (CLS) theory from cognitive science. The human brain doesn't store every sensory experience verbatim. Instead, it filters important events, consolidates related memories during sleep, and retrieves adaptively based on current needs. SimpleMem applies these principles to LLM memory.
CLS posits that the brain uses two interacting systems: a fast-learning hippocampus for episodic details and a slower neocortical system that extracts general patterns. SimpleMem mirrors this by quickly storing raw facts (Stage 1) then consolidating them into abstractions (Stage 2) during background processing.
How it flows
-
Input: Raw conversation arrives as message pairs (user input, agent response)
-
Stage 1 (Compression): Entropy-aware filtering removes low-value content. Remaining messages get decomposed into atomic, context-independent facts with absolute timestamps.
-
Stage 2 (Consolidation): Related memory units cluster together. Recurring patterns merge into abstract representations ("user drinks coffee every morning" instead of 47 separate coffee orders).
-
Stage 3 (Retrieval): When a query arrives, the system estimates its complexity and retrieves just enough context. Simple lookups get 3 memories; complex reasoning gets 20.
-
Output: Compact context (~550 tokens) fed to the LLM for response generation.
Stage 1: Semantic Compression
The first stage tackles context inflation at the source. Most conversations contain substantial "noise": phatic expressions, redundant confirmations, and off-topic tangents that contribute nothing to downstream reasoning.
Think of it like a skilled executive assistant sitting in a meeting. They don't transcribe every word. They write down decisions, action items, and key facts while ignoring the "how was your weekend" chatter and the "sounds good" confirmations. SimpleMem does the same for your agent's conversations.
Entropy measures information density. A message like "The project deadline is March 15th" has high entropy: it contains a specific, non-obvious fact. A message like "Okay, sounds good!" has low entropy: it confirms something already established without adding new information. SimpleMem calculates an entropy score for each conversation window and only stores high-entropy content.
Imagine a gold prospector shaking a pan of river sediment. The pan swirls, and only the heavy, glittering nuggets settle at the bottom while the light silt washes away. Entropy-aware filtering is that prospector's pan—it lets the low-value chatter (the silt) flow out and captures the dense, novel facts (the nuggets) for the memory bank.
The filtering mechanism
SimpleMem processes conversation in sliding windows of 10 messages. For each window, it calculates an information score based on two signals:
-
Entity novelty: Does this window mention new named entities (people, places, dates, products) not seen in recent history?
-
Semantic divergence: Does the embedding of this window differ substantially from recent conversation?
Windows scoring below a threshold (0.35 in the paper's configuration) get discarded entirely. They never enter memory.
Example: Entropy scores in action
For a 10-message window from a support chat, the filter produces these scores:
| Message | Entropy Score | Decision |
|---|---|---|
| "Your order #12345 will arrive on April 3rd." | 0.78 | ✓ Kept |
| "Got it, thanks!" | 0.12 | ✗ Discarded |
| "Can you also send me the invoice?" | 0.65 | ✓ Kept |
| "Sure, one sec." | 0.18 | ✗ Discarded |
| "Here's the link: invoice.pdf" | 0.71 | ✓ Kept |
With the 0.35 threshold, only the three high-entropy turns are stored, while the polite confirmations are dropped entirely—cutting storage by 40% in this example.
Atomic decomposition
Windows that pass the filter undergo transformation. Raw dialogue gets decomposed into atomic, context-independent memory units:
Before (raw dialogue):
User: "He said he'd finish it by next Friday" Agent: "Got it, I'll remind you then"
After (atomic units):
- "Bob committed to finishing the report by 2026-01-10"
- "User requested reminder for Bob's report deadline"
The transformation involves three operations:
- Coreference resolution: "He" becomes "Bob" based on conversation context
- Temporal anchoring: "next Friday" becomes "2026-01-10"
- Statement extraction: Full sentences replace conversational fragments
Triple-layer indexing
Each memory unit gets indexed three ways for flexible retrieval:
| Layer | Purpose | Example Query Match |
|---|---|---|
| Semantic | Dense embedding for fuzzy matching | "coffee preference" matches "morning latte order" |
| Lexical | BM25 for exact keyword matching | "Bob" matches only memories mentioning Bob |
| Symbolic | Structured metadata for filtering | "date > 2026-01-01" filters to recent memories |
BM25 is a bag-of-words ranking function that scores documents based on term frequency, document length, and inverse document frequency. It excels at exact keyword matches, complementing dense-vector similarity in hybrid retrieval pipelines. When you search for "Bob," BM25 ensures only memories literally containing "Bob" are matched.
This multi-view indexing enables queries that combine conceptual similarity with hard constraints. "What did Bob say last week about the project?" uses all three layers.
Stage 2: Recursive Consolidation
Even after filtering, long-running agents accumulate thousands of memory units. Stage 2 addresses this through consolidation: merging related memories into higher-level abstractions.
During sleep, the brain replays recent experiences and integrates them with existing knowledge. Similar experiences merge into generalized patterns. You might remember "I usually have coffee in the morning" without recalling each individual coffee. SimpleMem mimics this with an asynchronous background process that runs periodically.
Clustering by affinity
SimpleMem calculates affinity scores between memory pairs based on:
- Semantic similarity: Cosine similarity between embeddings
- Temporal proximity: Memories closer in time are more likely related
When a cluster of memories exceeds an affinity threshold (0.85), the system triggers consolidation. The cluster merges into a single abstract representation:
Before (47 individual memories):
- "User ordered latte at 8:00 AM on Monday"
- "User ordered latte at 8:15 AM on Tuesday"
- "User ordered cappuccino at 8:05 AM on Wednesday"
- ... (44 more coffee orders)
After (1 abstract representation):
- "User regularly orders coffee (usually latte) in the morning around 8 AM"
Memory Consolidation in Action
47 similar memories merge into 1 abstract pattern
Think of a librarian who, after months of cataloguing individual newspaper clippings about daily coffee orders, creates a single reference volume titled "Morning Coffee Habits." The librarian copies the essential patterns from each clipping, discards the redundant dates, and binds them into one concise chapter. Recursive consolidation does the same—it extracts the recurring theme, archives the raw clippings, and stores the distilled chapter for quick lookup.
The original fine-grained memories get archived. They're still accessible if needed, but the active memory index stays compact. This allows retrieval complexity to scale gracefully with interaction history.
Why consolidation matters
The ablation study shows consolidation's impact most clearly on multi-hop reasoning:
| Configuration | Multi-hop F1 | Change |
|---|---|---|
| Full SimpleMem | 43.46 | - |
| Without Consolidation | 29.85 | -31.3% |
Multi-hop questions require synthesizing information from multiple disconnected facts. Without consolidation, the retriever must find many fragmented memories and hope the LLM can connect them. With consolidation, related facts are already pre-synthesized into coherent abstractions.
Stage 3: Adaptive Retrieval
Standard retrieval systems fetch a fixed number of results regardless of query complexity. A simple factual lookup ("What's Bob's email?") gets the same 10 results as a complex reasoning question ("Based on Bob's schedule conflicts last month, when should we propose the next meeting?").
SimpleMem introduces query-aware retrieval that adjusts scope dynamically.
Query complexity estimation
A lightweight classifier estimates each query's complexity on a 0-1 scale based on:
- Query length and syntactic structure
- Number of entities mentioned
- Abstraction level (specific fact vs. pattern recognition)
Adaptive depth
The retrieval depth adjusts based on complexity:
| Complexity Score | Retrieval Depth | Use Case |
|---|---|---|
| 0.0 - 0.3 | 3 memories | Direct fact lookup |
| 0.3 - 0.7 | 5-10 memories | Single-step reasoning |
| 0.7 - 1.0 | 15-20 memories | Multi-hop reasoning |
Simple queries get minimal context, saving tokens. Complex queries get expanded context, ensuring accuracy. The system achieves near-optimal performance at k=3 for simple queries while scaling up for harder ones.
Example: Two queries, different depths
Consider two user questions in the same session:
-
Simple query: "What is Bob's email?"
- The classifier sees a short 5-word sentence with a single entity
- Complexity score: 0.22 → retrieves top 3 memories (~45 tokens)
- LLM answers: "bob@example.com" ✓
-
Complex query: "Based on Bob's meetings last week and his travel schedule, when is the earliest slot he can attend the project kickoff?"
- The classifier detects 3 entities, a 23-word sentence, and temporal reasoning
- Complexity score: 0.84 → retrieves top 18 memories (~280 tokens)
- With richer context, LLM correctly replies: "Tuesday 10 AM EST" ✓
The simple query uses 84% fewer tokens than if a fixed k=20 were always applied, while the complex query gets the context it needs.
Hybrid scoring
Retrieval combines all three index layers:
- Semantic score: Embedding similarity between query and memory
- Lexical score: BM25 keyword matching
- Symbolic constraint: Hard filters on metadata (date ranges, entity types)
The final score weights these components, with an indicator function enforcing symbolic constraints as hard requirements. A memory must match entity filters to be retrieved, regardless of semantic similarity.
Adaptive Retrieval Paths
Simple queries get minimal context; complex queries get expanded context
Benchmark Results
SimpleMem was evaluated on LoCoMo, a benchmark specifically designed for long-term conversational memory with 200-400 turn conversations across multiple sessions.
Performance vs Token Efficiency
SimpleMem achieves highest accuracy with lowest token cost on LoCoMo benchmark
Performance across model sizes
SimpleMem works across the capability spectrum, from GPT-4 to 1.5B parameter open-source models.
| Model | SimpleMem F1 | Mem0 F1 | Tokens |
|---|---|---|---|
| GPT-4.1-mini | 43.24 | 34.20 | 531 |
| GPT-4o | 39.06 | 36.09 | 550 |
| Qwen3-8B | 33.45 | 25.80 | 621 |
| Qwen2.5-1.5B | 25.23 | 23.77 | 678 |
The performance gap is largest on smaller models. A 1.5B model with SimpleMem approaches the accuracy of much larger models using inferior memory systems. This makes SimpleMem particularly valuable for edge deployment where model size is constrained.
Performance by task type
Component Impact (Ablation Study)
Removing each stage shows its contribution to different task types
SimpleMem shows balanced improvements across all task categories, with particular strength in temporal reasoning:
| Task Type | SimpleMem | Mem0 | Gap |
|---|---|---|---|
| Temporal | 58.62 | 48.91 | +9.71 |
| SingleHop | 51.12 | 41.30 | +9.82 |
| MultiHop | 43.46 | 30.14 | +13.32 |
| OpenDomain | 19.76 | 16.43 | +3.33 |
The temporal reasoning advantage comes directly from Stage 1's timestamp normalization. Converting "last Friday" to absolute dates at ingestion time makes temporal queries unambiguous regardless of when they're asked.
Efficiency comparison
Beyond accuracy, SimpleMem delivers substantial efficiency gains:
| System | Construction Time | Retrieval Time | Total Time |
|---|---|---|---|
| SimpleMem | 92.6s | 388.3s | 480.9s |
| Mem0 | 1,350.9s | 583.4s | 1,934.3s |
| A-Mem | 5,140.5s | 796.7s | 5,937.2s |
SimpleMem is 4x faster than Mem0 and 12x faster than A-Mem. The speed comes from single-pass processing at ingestion rather than iterative LLM calls for filtering and summarization.
Practical Applications
SimpleMem addresses real bottlenecks in production agent systems.
Customer support agents
Support agents need to remember customer history across sessions. Prior purchases, past issues, stated preferences. Traditional approaches either forget context between sessions or bloat prompts with irrelevant history.
With SimpleMem:
- Customer preferences consolidate into compact profiles
- Issue history stays accessible without token overhead
- Temporal context ("the order I placed last month") resolves correctly
Personal AI assistants
Personal assistants that span months of interaction accumulate substantial context. Calendar events, preferences, relationships, ongoing projects.
SimpleMem enables:
- Long-term preference learning without context window limits
- Project continuity across sessions separated by weeks
- Accurate temporal reasoning ("remind me about what we discussed before my vacation")
Autonomous coding agents
Coding agents maintain context about project structure, past decisions, and ongoing tasks. This context is critical for coherent multi-file changes but expensive to maintain.
SimpleMem allows:
- Project knowledge persists across sessions
- Decision rationale stays accessible without repeating explanations
- Context-aware suggestions based on established patterns
Multi-agent systems
Complex systems with multiple specialized agents need shared memory. SimpleMem's compact representation reduces the cost of memory synchronization between agents.
Implementation Blueprint
Here's how to integrate SimpleMem into an existing agent system.
Tech stack
The paper's codebase reveals the exact tools that produced the benchmark results.
| Component | Recommended |
|---|---|
| Embeddings | text-embedding-3-small |
| Database | LanceDB |
| Lexical | BM25 (built-in) |
| Metadata | SQLite |
| LLM | GPT-4.1-mini |
Alternatives: Cohere embed-v3 for embeddings, Qdrant/Milvus for vectors, Elasticsearch for lexical, PostgreSQL for metadata, Claude/Qwen for LLM.
SimpleMem's multi-view indexing requires hybrid search across dense vectors, sparse lexical features, and structured metadata. LanceDB supports all three in a single system with SQL-like filtering. This eliminates the need to manage separate vector and relational databases, reducing infrastructure complexity and Total Cost of Ownership. One database instead of three.
Key parameters
These values produced the benchmark results. Start here and tune for your use case.
| Parameter | Value | Purpose |
|---|---|---|
| Window size | 10 messages | Sliding window for filtering |
| Entropy threshold | 0.35 | Minimum score to store |
| Cluster threshold | 0.85 | Affinity to trigger consolidation |
| k_min | 3 | Retrieval depth for simple queries |
| k_max | 20 | Retrieval depth for complex queries |
| Embedding dimensions | 1,536 | text-embedding-3-small output |
Core workflow
Step 1: Ingestion pipeline
Message Pair → Sliding Window → Entropy Filter
↓
[Pass / Discard]
↓
Atomic Decomposition
↓
Triple-Layer Index
Process each conversation turn through the pipeline. Messages that pass entropy filtering get decomposed into atomic units with resolved coreferences and absolute timestamps.
Code example: Entropy-based filtering
import numpy as np
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
def calculate_info_score(window, history, alpha=0.5):
"""
Calculate information score for a
conversation window.
"""
# Entity novelty: new entities / total tokens
new_entities = extract_new_entities(window, history)
entity_score = len(new_entities) / len(window.split())
# Semantic divergence from recent history
w_emb = model.encode(window)
h_emb = model.encode(history[-500:]) # Last 500 chars
cos_sim = np.dot(w_emb, h_emb) / (
np.linalg.norm(w_emb) * np.linalg.norm(h_emb)
)
divergence = 1 - cos_sim
return alpha * entity_score + (1 - alpha) * divergence
# Filter: only store if score >= threshold
if calculate_info_score(window, history) >= 0.35:
store_memory(decompose(window))Step 2: Background consolidation
Run consolidation periodically (hourly or daily) as a background job:
- Compute pairwise affinity scores for recent memories
- Identify clusters exceeding the threshold
- Generate abstract representations via LLM summarization
- Archive originals, index abstractions
Step 3: Query-time retrieval
Query → Complexity Estimation → Dynamic k
↓
Hybrid Scoring (semantic + lexical + symbolic)
↓
Top-k Retrieval → Context Assembly
↓
LLM Response Generation
Common pitfalls
Entropy threshold too high: If you're discarding too much, important information gets lost. Start conservative (0.25) and increase only if storage becomes problematic.
No temporal normalization: Skipping timestamp conversion seems like an optimization but breaks temporal reasoning. Always convert relative expressions at ingestion.
Fixed retrieval depth: The easy path is picking a single k value. This wastes tokens on simple queries and starves complex ones. Implement adaptive depth even if the classifier is simple.
Consolidation too aggressive: Over-consolidating loses detail. The paper uses 0.85 affinity threshold, meaning memories must be highly related before merging. Don't lower this without careful evaluation.
Production hardening
The paper's benchmark results come from controlled experiments. Production systems need additional safeguards.
Hot/cold storage pattern
Consolidation runs asynchronously as a background job. If a user states a fact and immediately asks about it, the system might miss it if it's waiting to be consolidated.
Solution: Implement a Redis "lookaside buffer" for the 10-20 most recent turns. The retriever should query both the long-term LanceDB store and the immediate Redis buffer. This ensures fresh facts are never missed.
Query → [Redis Buffer] + [LanceDB Long-term]
↓ ↓
Merge Results → Deduplicate → Top-k
PII redaction before ingestion
Once data is embedded into vectors, removing specific PII (like a credit card number accidentally mentioned) becomes technically difficult. Vectors encode semantic meaning, not raw text, making surgical deletion nearly impossible.
Solution: Insert a PII scrubbing step before Stage 1 using tools like Microsoft Presidio or regex patterns. Store "User shared [PHONE_NUMBER]" rather than the raw data.
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
def scrub_pii(text):
results = analyzer.analyze(text=text, language='en')
return anonymizer.anonymize(text=text, analyzer_results=results)
# Run before ingestion
clean_message = scrub_pii(raw_message)Code block bypass
Code blocks have high entropy (many unique tokens) but compress poorly. You cannot summarize a function without breaking it. The consolidation stage may mangle syntax if it tries to abstract code.
Solution: Add a heuristic to detect code blocks (triple backticks, consistent indentation, syntax patterns). Store code snippets raw in a separate collection or blob store, linked by reference.
import re
def is_code_block(text):
patterns = [
r'```[\s\S]*?```', # Markdown code blocks
r'^\s{4,}[\w\(\)]+', # Indented code
r'def\s+\w+\s*\(', # Python functions
r'function\s+\w+\s*\(', # JS functions
]
return any(re.search(p, text) for p in patterns)
if is_code_block(window):
store_raw_code(window) # Skip compression
else:
store_memory(decompose(window))Discard rate monitoring
The entropy filter is powerful but dangerous. If it breaks, your agent becomes amnesiac. Model drift or data distribution changes can cause the filter to over-discard or under-filter.
Solution: Track the percentage of messages discarded as a metric. Alert if this spikes above baseline (greater than 60%) or drops to near zero (below 5%).
from prometheus_client import Gauge
discard_rate = Gauge('simplemem_discard_rate', 'Percentage of messages filtered')
def monitor_filter(passed, total):
rate = (total - passed) / total * 100
discard_rate.set(rate)
if rate > 60:
alert("High discard rate: filter may be too aggressive")
elif rate < 5:
alert("Low discard rate: filter may be broken")Entropy threshold tuning
The paper uses 0.35. This magic number works for general conversation but may fail for domain-specific applications. A legal bot where every word matters needs a lower threshold. A casual chat bot might tolerate higher.
Solution: Run an A/B test on thresholds. Test 0.25, 0.35, and 0.45 on a sample dataset. Manually inspect 100 discarded messages from each threshold to verify no valuable information was lost.
| Domain | Suggested Starting Threshold |
|---|---|
| Legal/Medical | 0.20 - 0.25 |
| Customer Support | 0.30 - 0.35 |
| Casual Chat | 0.40 - 0.50 |
Complexity classifier feedback loop
The classifier decides if a query needs 3 or 20 memories. If it's wrong, the user gets a bad answer but you have no signal to improve.
Solution: Implement a "Retry with More Context" button or detect regeneration requests. Log these as potential classifier failures and use those samples to fine-tune.
def handle_regenerate(query_id):
original = get_query_context(query_id)
# Log as classifier failure
log_classifier_miss(
query=original.query,
predicted_complexity=original.complexity,
k_used=original.k
)
# Retry with maximum context
return retrieve_and_respond(original.query, k=20)Golden query testing
How do you know if your memory system is working after a deployment? Silent failures are the worst kind.
Solution: Create a "Golden Dataset" of 50 known facts (e.g., "My favorite color is blue"). Run a daily CI/CD job that ingests these facts into a test user and queries them. If F1 score drops below threshold, the deployment broke something.
# golden_queries.yaml
- fact: "User's favorite color is blue"
query: "What is my favorite color?"
expected: "blue"
- fact: "User has a dog named Max"
query: "What is my pet's name?"
expected: "Max"Cost inversion warning
SimpleMem shifts cost from inference (reading tokens) to ingestion (processing tokens). You pay to process every user message through the compression pipeline upfront to save money on retrieval. The ROI is positive only for long-lived agents with conversations exceeding 50 turns. For short interactions, the ingestion overhead may outweigh savings.
Model the break-even point for your use case:
| Avg Conversation Length | SimpleMem ROI |
|---|---|
| < 20 turns | Negative (overhead exceeds savings) |
| 20-50 turns | Break-even |
| 50-100 turns | 2-3x positive |
| > 100 turns | 5-10x positive |
Migration strategy for existing data
Companies have terabytes of existing chat logs. Processing them all at once through the compression pipeline is expensive and slow.
Solution: Use "lazy migration." Only run the expensive compression pipeline on a user's history when that user logs in, rather than batch-processing the entire inactive database.
async def get_user_memory(user_id):
if not is_migrated(user_id):
# First login since SimpleMem deployment
raw_history = load_legacy_history(user_id)
await migrate_to_simplemem(raw_history) # Background job
mark_migrated(user_id)
return query_simplemem(user_id)GDPR "right to forget" compliance
Users have a right to be forgotten. In a consolidated memory system, one user fact might be merged into a summary with others. Deleting the original message doesn't remove it from the abstraction.
Solution: Maintain a "source ID" link between abstract summaries and original raw messages. If a user deletes a message, flag derived summaries for re-computation.
class ConsolidatedMemory:
id: str
abstract_text: str
source_message_ids: list[str] # Track provenance
def handle_deletion_request(message_id):
# Find all summaries that used this message
affected = find_summaries_by_source(message_id)
for summary in affected:
# Remove the source, regenerate or delete summary
summary.source_message_ids.remove(message_id)
if len(summary.source_message_ids) == 0:
delete_summary(summary.id)
else:
regenerate_summary(summary.id)Limitations
SimpleMem has constraints worth understanding before adoption.
Lossy by design
Entropy filtering discards content permanently. If the filter incorrectly classifies something as low-value, that information is gone. The paper's 0.35 threshold was tuned for the LoCoMo benchmark. Other domains (legal, medical) may need lower thresholds to preserve more context.
Consolidation latency
Abstract representations generate via LLM calls during background consolidation. Until consolidation runs, recent memories remain fragmented. Real-time applications may need to trigger consolidation more frequently or maintain separate fast-access and consolidated stores.
Query complexity estimation
The lightweight classifier for query complexity adds a prediction step that can fail. Misclassifying a complex query as simple leads to under-retrieval and poor answers. The paper doesn't detail the classifier architecture or failure modes.
Benchmark scope
LoCoMo tests conversational memory specifically. Performance on other agent tasks (tool use, code generation, web browsing) remains unvalidated. The memory patterns in those domains may differ substantially.
Single-user assumption
The architecture assumes memories belong to a single user context. Multi-user or multi-tenant scenarios need additional isolation mechanisms not addressed in the paper.
Cold start period
The consolidation stage needs a critical mass of memories before it produces useful abstractions. In the first few days of deployment, the system operates without consolidated patterns. Expect the full accuracy benefits to emerge after a week or more of active usage. Plan your pilot accordingly: test with synthetic history or accept that Day 1 performance won't match the benchmarks.
Paper: arXiv:2601.02553 Authors: Jiaqi Liu, Yaofeng Su, Peng Xia, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, Huaxiu Yao (AIMING Lab) Code: github.com/aiming-lab/SimpleMem Original paper: arXiv | PDF | HTML
Cite this paper
Jiaqi Liu, Yaofeng Su, Peng Xia, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, Huaxiu Yao (2026). SimpleMem: 30x More Efficient Memory for LLM Agents. arXiv 2026.