Tekta.ai LogoTektaai
arXiv 2026January 5, 2026

SimpleMem: 30x More Efficient Memory for LLM Agents

Jiaqi Liuet al.

SimpleMem addresses the token cost explosion in LLM agents by introducing semantic lossless compression. The three-stage pipeline filters redundant conversation content, consolidates related memories into abstract representations, and dynamically adjusts retrieval depth based on query complexity. On the LoCoMo benchmark, SimpleMem achieves 43.24 F1 with only 531 tokens per query, compared to 34.20 F1 and 973 tokens for Mem0, and 18.70 F1 and 16,910 tokens for full-context approaches.

Categories:AI AgentsNatural Language Processing

Key Findings

1

Slashes inference costs by 97%: reduces context from ~17,000 tokens to just ~550 tokens per query, enabling 'infinite' conversations over months of interaction

2

Boosts retrieval accuracy by 26%: outperforms leading memory systems like Mem0 on multi-session conversation benchmarks

3

Makes small models smart: enables a 1.5B parameter model (running on a laptop) to match GPT-4o in memory tasks—zero fine-tuning required

4

Active noise cancellation for text: automatically discards chit-chat and confirmations (up to 50% of tokens) while consolidating facts like human sleep consolidates memories

5

Solves the 'Last Friday' problem: correctly identifies dates across sessions (knowing 'last Friday' meant 'Dec 27th') even weeks later

6

4x faster response times: retrieval latency drops significantly compared to graph-based memory systems, with privacy-ready local storage via LanceDB

TL;DR
  1. The Problem. LLM agents accumulate conversation history that explodes token costs. A 200-turn conversation consumes ~17,000 tokens per query. Most of that content is redundant chit-chat that actually hurts reasoning accuracy.

  2. The Solution. SimpleMem compresses memories through three stages: filter low-value content, consolidate related facts into abstractions, and adapt retrieval depth to query complexity. It's a drop-in runtime layer (no fine-tuning or retraining required) that wraps your existing LLM API calls.

  3. The Results. 30x token reduction (550 vs 17,000 tokens), 26% accuracy improvement over Mem0, and 4x faster processing. Works with any LLM from GPT-4 down to 1.5B parameter models.

Research Overview

Every LLM agent faces the same problem: conversations accumulate, context windows fill up, and costs spiral.

Consider a customer support agent. After 50 interactions with a user, the agent has accumulated hours of conversation history. The naive approach is to stuff everything into the context window. This works until it doesn't. Token costs become prohibitive. Worse, the model's reasoning degrades as it drowns in irrelevant details from conversations that happened weeks ago.

What is an LLM agent?

An LLM agent is an AI system that maintains state across multiple interactions and can take actions. Unlike a simple chatbot that treats each message independently, an agent remembers past conversations, tracks user preferences, and builds on previous context. Examples include personal AI assistants, autonomous coding agents, and customer support systems.

SimpleMem addresses this with a memory framework based on semantic lossless compression. The key insight is that most conversation content is noise. Greetings, confirmations, small talk, and repetitive exchanges consume tokens without adding value. SimpleMem filters this noise at ingestion, consolidates related facts into compact representations, and retrieves only what each query actually needs.

Key results

The paper evaluates SimpleMem on LoCoMo, a benchmark designed for long-term conversational memory with 200-400 turn conversations.

What is LoCoMo?

LoCoMo (Long-Context Memory) is a synthetic benchmark that simulates multi-turn dialogues (200-400 turns) to test how well an LLM agent retains and retrieves information over long conversations. It stresses both token budget and reasoning accuracy, making it a standard yardstick for memory-compression techniques.

MetricSimpleMemMem0Full Context
Average F143.2434.2018.70
Token Cost53197316,910
Total Time480s1,934s-
What is F1 Score?

F1 combines precision (how many answers were correct) and recall (how many correct answers were found). Think of it as measuring both accuracy and completeness. A score of 43.24 means the system gets roughly 43% of questions fully correct, which is strong performance for open-ended conversational memory tasks.

SimpleMem achieves higher accuracy with fewer tokens. The full-context baseline (stuffing everything into the prompt) performs worst despite using 30x more tokens, confirming that more context is not better context.

What this means for your budget: At GPT-4o pricing ($2.50 per million input tokens), SimpleMem reduces per-query costs from ~$0.042 (17K tokens) to ~$0.0014 (550 tokens). That's a 97% cost reduction. For an agent handling 10,000 queries per day, that's $400 saved daily.

The Memory Problem

Current memory systems for LLM agents fall into two camps. Neither works well.

Full-context extension keeps everything. Every message, every response, every "sounds good" and "got it" stays in memory. This approach suffers from three problems:

  1. Token costs explode. At $10-15 per million input tokens for frontier models, a 17,000 token context for every query adds up fast.

  2. Accuracy degrades. Research shows that LLMs struggle with "lost in the middle" effects. Important information buried in long contexts gets ignored.

  3. Latency increases. More tokens mean slower inference. Users notice.

Iterative filtering uses the LLM itself to decide what to keep. Systems like MemGPT make multiple inference calls to summarize and prune memories. This reduces storage but creates a new problem: the filtering process itself costs tokens. You're paying to decide what not to pay for.

Why does conversation history balloon?

Most human conversation is not information-dense. We say "hi," ask "how are you," confirm understanding with "got it," and fill silences with small talk. In text, this might be 30-50% of all messages. An agent that stores everything is storing mostly noise.

SimpleMem takes a different approach: filter at ingestion time using cheap heuristics, not expensive LLM calls. The three-stage pipeline processes content once when it arrives, not repeatedly when it's retrieved.

The temporal problem

Long-running agents face a unique challenge: relative time references become ambiguous.

When a user says "last Friday" during a conversation on January 3rd, they mean December 27th. But if the agent retrieves this memory on January 15th, "last Friday" could be interpreted as January 10th. This ambiguity breaks temporal reasoning.

SimpleMem solves this by converting all relative temporal expressions to absolute ISO-8601 timestamps at ingestion time:

// Before (ambiguous)
{"text": "He said he'd finish it by next Friday"}
 
// After (unambiguous)
{
  "text": "Bob committed to finishing report",
  "timestamp": "2026-01-10T17:00:00Z",
  "entities": ["Bob", "report"]
}

"Last Friday" becomes "2025-12-27" and stays correct regardless of when it's retrieved.

Architecture

SimpleMem operates through three stages that progressively refine raw conversation into compact, queryable memory.

SimpleMem Architecture

Three stages compress, consolidate, and retrieve memory efficiently

The design is inspired by Complementary Learning Systems (CLS) theory from cognitive science. The human brain doesn't store every sensory experience verbatim. Instead, it filters important events, consolidates related memories during sleep, and retrieves adaptively based on current needs. SimpleMem applies these principles to LLM memory.

Complementary Learning Systems (CLS) theory

CLS posits that the brain uses two interacting systems: a fast-learning hippocampus for episodic details and a slower neocortical system that extracts general patterns. SimpleMem mirrors this by quickly storing raw facts (Stage 1) then consolidating them into abstractions (Stage 2) during background processing.

How it flows

  1. Input: Raw conversation arrives as message pairs (user input, agent response)

  2. Stage 1 (Compression): Entropy-aware filtering removes low-value content. Remaining messages get decomposed into atomic, context-independent facts with absolute timestamps.

  3. Stage 2 (Consolidation): Related memory units cluster together. Recurring patterns merge into abstract representations ("user drinks coffee every morning" instead of 47 separate coffee orders).

  4. Stage 3 (Retrieval): When a query arrives, the system estimates its complexity and retrieves just enough context. Simple lookups get 3 memories; complex reasoning gets 20.

  5. Output: Compact context (~550 tokens) fed to the LLM for response generation.

Stage 1: Semantic Compression

The first stage tackles context inflation at the source. Most conversations contain substantial "noise": phatic expressions, redundant confirmations, and off-topic tangents that contribute nothing to downstream reasoning.

Think of it like a skilled executive assistant sitting in a meeting. They don't transcribe every word. They write down decisions, action items, and key facts while ignoring the "how was your weekend" chatter and the "sounds good" confirmations. SimpleMem does the same for your agent's conversations.

What is entropy-aware filtering?

Entropy measures information density. A message like "The project deadline is March 15th" has high entropy: it contains a specific, non-obvious fact. A message like "Okay, sounds good!" has low entropy: it confirms something already established without adding new information. SimpleMem calculates an entropy score for each conversation window and only stores high-entropy content.

Imagine a gold prospector shaking a pan of river sediment. The pan swirls, and only the heavy, glittering nuggets settle at the bottom while the light silt washes away. Entropy-aware filtering is that prospector's pan—it lets the low-value chatter (the silt) flow out and captures the dense, novel facts (the nuggets) for the memory bank.

The filtering mechanism

SimpleMem processes conversation in sliding windows of 10 messages. For each window, it calculates an information score based on two signals:

  1. Entity novelty: Does this window mention new named entities (people, places, dates, products) not seen in recent history?

  2. Semantic divergence: Does the embedding of this window differ substantially from recent conversation?

Windows scoring below a threshold (0.35 in the paper's configuration) get discarded entirely. They never enter memory.

Example: Entropy scores in action

For a 10-message window from a support chat, the filter produces these scores:

MessageEntropy ScoreDecision
"Your order #12345 will arrive on April 3rd."0.78✓ Kept
"Got it, thanks!"0.12✗ Discarded
"Can you also send me the invoice?"0.65✓ Kept
"Sure, one sec."0.18✗ Discarded
"Here's the link: invoice.pdf"0.71✓ Kept

With the 0.35 threshold, only the three high-entropy turns are stored, while the polite confirmations are dropped entirely—cutting storage by 40% in this example.

Atomic decomposition

Windows that pass the filter undergo transformation. Raw dialogue gets decomposed into atomic, context-independent memory units:

Before (raw dialogue):

User: "He said he'd finish it by next Friday" Agent: "Got it, I'll remind you then"

After (atomic units):

  • "Bob committed to finishing the report by 2026-01-10"
  • "User requested reminder for Bob's report deadline"

The transformation involves three operations:

  1. Coreference resolution: "He" becomes "Bob" based on conversation context
  2. Temporal anchoring: "next Friday" becomes "2026-01-10"
  3. Statement extraction: Full sentences replace conversational fragments

Triple-layer indexing

Each memory unit gets indexed three ways for flexible retrieval:

LayerPurposeExample Query Match
SemanticDense embedding for fuzzy matching"coffee preference" matches "morning latte order"
LexicalBM25 for exact keyword matching"Bob" matches only memories mentioning Bob
SymbolicStructured metadata for filtering"date > 2026-01-01" filters to recent memories
What is BM25?

BM25 is a bag-of-words ranking function that scores documents based on term frequency, document length, and inverse document frequency. It excels at exact keyword matches, complementing dense-vector similarity in hybrid retrieval pipelines. When you search for "Bob," BM25 ensures only memories literally containing "Bob" are matched.

This multi-view indexing enables queries that combine conceptual similarity with hard constraints. "What did Bob say last week about the project?" uses all three layers.

Stage 2: Recursive Consolidation

Even after filtering, long-running agents accumulate thousands of memory units. Stage 2 addresses this through consolidation: merging related memories into higher-level abstractions.

How does biological memory consolidation work?

During sleep, the brain replays recent experiences and integrates them with existing knowledge. Similar experiences merge into generalized patterns. You might remember "I usually have coffee in the morning" without recalling each individual coffee. SimpleMem mimics this with an asynchronous background process that runs periodically.

Clustering by affinity

SimpleMem calculates affinity scores between memory pairs based on:

  1. Semantic similarity: Cosine similarity between embeddings
  2. Temporal proximity: Memories closer in time are more likely related

When a cluster of memories exceeds an affinity threshold (0.85), the system triggers consolidation. The cluster merges into a single abstract representation:

Before (47 individual memories):

  • "User ordered latte at 8:00 AM on Monday"
  • "User ordered latte at 8:15 AM on Tuesday"
  • "User ordered cappuccino at 8:05 AM on Wednesday"
  • ... (44 more coffee orders)

After (1 abstract representation):

  • "User regularly orders coffee (usually latte) in the morning around 8 AM"

Memory Consolidation in Action

47 similar memories merge into 1 abstract pattern

Think of a librarian who, after months of cataloguing individual newspaper clippings about daily coffee orders, creates a single reference volume titled "Morning Coffee Habits." The librarian copies the essential patterns from each clipping, discards the redundant dates, and binds them into one concise chapter. Recursive consolidation does the same—it extracts the recurring theme, archives the raw clippings, and stores the distilled chapter for quick lookup.

The original fine-grained memories get archived. They're still accessible if needed, but the active memory index stays compact. This allows retrieval complexity to scale gracefully with interaction history.

Why consolidation matters

The ablation study shows consolidation's impact most clearly on multi-hop reasoning:

ConfigurationMulti-hop F1Change
Full SimpleMem43.46-
Without Consolidation29.85-31.3%

Multi-hop questions require synthesizing information from multiple disconnected facts. Without consolidation, the retriever must find many fragmented memories and hope the LLM can connect them. With consolidation, related facts are already pre-synthesized into coherent abstractions.

Stage 3: Adaptive Retrieval

Standard retrieval systems fetch a fixed number of results regardless of query complexity. A simple factual lookup ("What's Bob's email?") gets the same 10 results as a complex reasoning question ("Based on Bob's schedule conflicts last month, when should we propose the next meeting?").

SimpleMem introduces query-aware retrieval that adjusts scope dynamically.

Query complexity estimation

A lightweight classifier estimates each query's complexity on a 0-1 scale based on:

  • Query length and syntactic structure
  • Number of entities mentioned
  • Abstraction level (specific fact vs. pattern recognition)

Adaptive depth

The retrieval depth adjusts based on complexity:

Complexity ScoreRetrieval DepthUse Case
0.0 - 0.33 memoriesDirect fact lookup
0.3 - 0.75-10 memoriesSingle-step reasoning
0.7 - 1.015-20 memoriesMulti-hop reasoning

Simple queries get minimal context, saving tokens. Complex queries get expanded context, ensuring accuracy. The system achieves near-optimal performance at k=3 for simple queries while scaling up for harder ones.

Example: Two queries, different depths

Consider two user questions in the same session:

  1. Simple query: "What is Bob's email?"

    • The classifier sees a short 5-word sentence with a single entity
    • Complexity score: 0.22 → retrieves top 3 memories (~45 tokens)
    • LLM answers: "bob@example.com" ✓
  2. Complex query: "Based on Bob's meetings last week and his travel schedule, when is the earliest slot he can attend the project kickoff?"

    • The classifier detects 3 entities, a 23-word sentence, and temporal reasoning
    • Complexity score: 0.84 → retrieves top 18 memories (~280 tokens)
    • With richer context, LLM correctly replies: "Tuesday 10 AM EST" ✓

The simple query uses 84% fewer tokens than if a fixed k=20 were always applied, while the complex query gets the context it needs.

Hybrid scoring

Retrieval combines all three index layers:

  1. Semantic score: Embedding similarity between query and memory
  2. Lexical score: BM25 keyword matching
  3. Symbolic constraint: Hard filters on metadata (date ranges, entity types)

The final score weights these components, with an indicator function enforcing symbolic constraints as hard requirements. A memory must match entity filters to be retrieved, regardless of semantic similarity.

Adaptive Retrieval Paths

Simple queries get minimal context; complex queries get expanded context

Benchmark Results

SimpleMem was evaluated on LoCoMo, a benchmark specifically designed for long-term conversational memory with 200-400 turn conversations across multiple sessions.

Performance vs Token Efficiency

SimpleMem achieves highest accuracy with lowest token cost on LoCoMo benchmark

Performance across model sizes

SimpleMem works across the capability spectrum, from GPT-4 to 1.5B parameter open-source models.

ModelSimpleMem F1Mem0 F1Tokens
GPT-4.1-mini43.2434.20531
GPT-4o39.0636.09550
Qwen3-8B33.4525.80621
Qwen2.5-1.5B25.2323.77678

The performance gap is largest on smaller models. A 1.5B model with SimpleMem approaches the accuracy of much larger models using inferior memory systems. This makes SimpleMem particularly valuable for edge deployment where model size is constrained.

Performance by task type

Component Impact (Ablation Study)

Removing each stage shows its contribution to different task types

SimpleMem shows balanced improvements across all task categories, with particular strength in temporal reasoning:

Task TypeSimpleMemMem0Gap
Temporal58.6248.91+9.71
SingleHop51.1241.30+9.82
MultiHop43.4630.14+13.32
OpenDomain19.7616.43+3.33

The temporal reasoning advantage comes directly from Stage 1's timestamp normalization. Converting "last Friday" to absolute dates at ingestion time makes temporal queries unambiguous regardless of when they're asked.

Efficiency comparison

Beyond accuracy, SimpleMem delivers substantial efficiency gains:

SystemConstruction TimeRetrieval TimeTotal Time
SimpleMem92.6s388.3s480.9s
Mem01,350.9s583.4s1,934.3s
A-Mem5,140.5s796.7s5,937.2s

SimpleMem is 4x faster than Mem0 and 12x faster than A-Mem. The speed comes from single-pass processing at ingestion rather than iterative LLM calls for filtering and summarization.

Practical Applications

SimpleMem addresses real bottlenecks in production agent systems.

Customer support agents

Support agents need to remember customer history across sessions. Prior purchases, past issues, stated preferences. Traditional approaches either forget context between sessions or bloat prompts with irrelevant history.

With SimpleMem:

  • Customer preferences consolidate into compact profiles
  • Issue history stays accessible without token overhead
  • Temporal context ("the order I placed last month") resolves correctly

Personal AI assistants

Personal assistants that span months of interaction accumulate substantial context. Calendar events, preferences, relationships, ongoing projects.

SimpleMem enables:

  • Long-term preference learning without context window limits
  • Project continuity across sessions separated by weeks
  • Accurate temporal reasoning ("remind me about what we discussed before my vacation")

Autonomous coding agents

Coding agents maintain context about project structure, past decisions, and ongoing tasks. This context is critical for coherent multi-file changes but expensive to maintain.

SimpleMem allows:

  • Project knowledge persists across sessions
  • Decision rationale stays accessible without repeating explanations
  • Context-aware suggestions based on established patterns

Multi-agent systems

Complex systems with multiple specialized agents need shared memory. SimpleMem's compact representation reduces the cost of memory synchronization between agents.

Implementation Blueprint

Here's how to integrate SimpleMem into an existing agent system.

Tech stack

The paper's codebase reveals the exact tools that produced the benchmark results.

ComponentRecommended
Embeddingstext-embedding-3-small
DatabaseLanceDB
LexicalBM25 (built-in)
MetadataSQLite
LLMGPT-4.1-mini

Alternatives: Cohere embed-v3 for embeddings, Qdrant/Milvus for vectors, Elasticsearch for lexical, PostgreSQL for metadata, Claude/Qwen for LLM.

Why LanceDB?

SimpleMem's multi-view indexing requires hybrid search across dense vectors, sparse lexical features, and structured metadata. LanceDB supports all three in a single system with SQL-like filtering. This eliminates the need to manage separate vector and relational databases, reducing infrastructure complexity and Total Cost of Ownership. One database instead of three.

Key parameters

These values produced the benchmark results. Start here and tune for your use case.

ParameterValuePurpose
Window size10 messagesSliding window for filtering
Entropy threshold0.35Minimum score to store
Cluster threshold0.85Affinity to trigger consolidation
k_min3Retrieval depth for simple queries
k_max20Retrieval depth for complex queries
Embedding dimensions1,536text-embedding-3-small output

Core workflow

Step 1: Ingestion pipeline

Message Pair → Sliding Window → Entropy Filter
                                     ↓
                              [Pass / Discard]
                                     ↓
                            Atomic Decomposition
                                     ↓
                            Triple-Layer Index

Process each conversation turn through the pipeline. Messages that pass entropy filtering get decomposed into atomic units with resolved coreferences and absolute timestamps.

Code example: Entropy-based filtering

import numpy as np
from sentence_transformers import SentenceTransformer
 
model = SentenceTransformer('all-MiniLM-L6-v2')
 
def calculate_info_score(window, history, alpha=0.5):
    """
    Calculate information score for a
    conversation window.
    """
    # Entity novelty: new entities / total tokens
    new_entities = extract_new_entities(window, history)
    entity_score = len(new_entities) / len(window.split())
 
    # Semantic divergence from recent history
    w_emb = model.encode(window)
    h_emb = model.encode(history[-500:])  # Last 500 chars
    cos_sim = np.dot(w_emb, h_emb) / (
        np.linalg.norm(w_emb) * np.linalg.norm(h_emb)
    )
    divergence = 1 - cos_sim
 
    return alpha * entity_score + (1 - alpha) * divergence
 
# Filter: only store if score >= threshold
if calculate_info_score(window, history) >= 0.35:
    store_memory(decompose(window))

Step 2: Background consolidation

Run consolidation periodically (hourly or daily) as a background job:

  1. Compute pairwise affinity scores for recent memories
  2. Identify clusters exceeding the threshold
  3. Generate abstract representations via LLM summarization
  4. Archive originals, index abstractions

Step 3: Query-time retrieval

Query → Complexity Estimation → Dynamic k
              ↓
       Hybrid Scoring (semantic + lexical + symbolic)
              ↓
       Top-k Retrieval → Context Assembly
              ↓
       LLM Response Generation

Common pitfalls

Entropy threshold too high: If you're discarding too much, important information gets lost. Start conservative (0.25) and increase only if storage becomes problematic.

No temporal normalization: Skipping timestamp conversion seems like an optimization but breaks temporal reasoning. Always convert relative expressions at ingestion.

Fixed retrieval depth: The easy path is picking a single k value. This wastes tokens on simple queries and starves complex ones. Implement adaptive depth even if the classifier is simple.

Consolidation too aggressive: Over-consolidating loses detail. The paper uses 0.85 affinity threshold, meaning memories must be highly related before merging. Don't lower this without careful evaluation.

Production hardening

The paper's benchmark results come from controlled experiments. Production systems need additional safeguards.

Hot/cold storage pattern

Consolidation runs asynchronously as a background job. If a user states a fact and immediately asks about it, the system might miss it if it's waiting to be consolidated.

Solution: Implement a Redis "lookaside buffer" for the 10-20 most recent turns. The retriever should query both the long-term LanceDB store and the immediate Redis buffer. This ensures fresh facts are never missed.

Query → [Redis Buffer] + [LanceDB Long-term]
              ↓                    ↓
         Merge Results → Deduplicate → Top-k

PII redaction before ingestion

Once data is embedded into vectors, removing specific PII (like a credit card number accidentally mentioned) becomes technically difficult. Vectors encode semantic meaning, not raw text, making surgical deletion nearly impossible.

Solution: Insert a PII scrubbing step before Stage 1 using tools like Microsoft Presidio or regex patterns. Store "User shared [PHONE_NUMBER]" rather than the raw data.

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
 
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
 
def scrub_pii(text):
    results = analyzer.analyze(text=text, language='en')
    return anonymizer.anonymize(text=text, analyzer_results=results)
 
# Run before ingestion
clean_message = scrub_pii(raw_message)

Code block bypass

Code blocks have high entropy (many unique tokens) but compress poorly. You cannot summarize a function without breaking it. The consolidation stage may mangle syntax if it tries to abstract code.

Solution: Add a heuristic to detect code blocks (triple backticks, consistent indentation, syntax patterns). Store code snippets raw in a separate collection or blob store, linked by reference.

import re
 
def is_code_block(text):
    patterns = [
        r'```[\s\S]*?```',           # Markdown code blocks
        r'^\s{4,}[\w\(\)]+',          # Indented code
        r'def\s+\w+\s*\(',            # Python functions
        r'function\s+\w+\s*\(',       # JS functions
    ]
    return any(re.search(p, text) for p in patterns)
 
if is_code_block(window):
    store_raw_code(window)  # Skip compression
else:
    store_memory(decompose(window))

Discard rate monitoring

The entropy filter is powerful but dangerous. If it breaks, your agent becomes amnesiac. Model drift or data distribution changes can cause the filter to over-discard or under-filter.

Solution: Track the percentage of messages discarded as a metric. Alert if this spikes above baseline (greater than 60%) or drops to near zero (below 5%).

from prometheus_client import Gauge
 
discard_rate = Gauge('simplemem_discard_rate', 'Percentage of messages filtered')
 
def monitor_filter(passed, total):
    rate = (total - passed) / total * 100
    discard_rate.set(rate)
 
    if rate > 60:
        alert("High discard rate: filter may be too aggressive")
    elif rate < 5:
        alert("Low discard rate: filter may be broken")

Entropy threshold tuning

The paper uses 0.35. This magic number works for general conversation but may fail for domain-specific applications. A legal bot where every word matters needs a lower threshold. A casual chat bot might tolerate higher.

Solution: Run an A/B test on thresholds. Test 0.25, 0.35, and 0.45 on a sample dataset. Manually inspect 100 discarded messages from each threshold to verify no valuable information was lost.

DomainSuggested Starting Threshold
Legal/Medical0.20 - 0.25
Customer Support0.30 - 0.35
Casual Chat0.40 - 0.50

Complexity classifier feedback loop

The classifier decides if a query needs 3 or 20 memories. If it's wrong, the user gets a bad answer but you have no signal to improve.

Solution: Implement a "Retry with More Context" button or detect regeneration requests. Log these as potential classifier failures and use those samples to fine-tune.

def handle_regenerate(query_id):
    original = get_query_context(query_id)
 
    # Log as classifier failure
    log_classifier_miss(
        query=original.query,
        predicted_complexity=original.complexity,
        k_used=original.k
    )
 
    # Retry with maximum context
    return retrieve_and_respond(original.query, k=20)

Golden query testing

How do you know if your memory system is working after a deployment? Silent failures are the worst kind.

Solution: Create a "Golden Dataset" of 50 known facts (e.g., "My favorite color is blue"). Run a daily CI/CD job that ingests these facts into a test user and queries them. If F1 score drops below threshold, the deployment broke something.

# golden_queries.yaml
- fact: "User's favorite color is blue"
  query: "What is my favorite color?"
  expected: "blue"
 
- fact: "User has a dog named Max"
  query: "What is my pet's name?"
  expected: "Max"

Cost inversion warning

For Product Managers

SimpleMem shifts cost from inference (reading tokens) to ingestion (processing tokens). You pay to process every user message through the compression pipeline upfront to save money on retrieval. The ROI is positive only for long-lived agents with conversations exceeding 50 turns. For short interactions, the ingestion overhead may outweigh savings.

Model the break-even point for your use case:

Avg Conversation LengthSimpleMem ROI
< 20 turnsNegative (overhead exceeds savings)
20-50 turnsBreak-even
50-100 turns2-3x positive
> 100 turns5-10x positive

Migration strategy for existing data

Companies have terabytes of existing chat logs. Processing them all at once through the compression pipeline is expensive and slow.

Solution: Use "lazy migration." Only run the expensive compression pipeline on a user's history when that user logs in, rather than batch-processing the entire inactive database.

async def get_user_memory(user_id):
    if not is_migrated(user_id):
        # First login since SimpleMem deployment
        raw_history = load_legacy_history(user_id)
        await migrate_to_simplemem(raw_history)  # Background job
        mark_migrated(user_id)
 
    return query_simplemem(user_id)

GDPR "right to forget" compliance

Users have a right to be forgotten. In a consolidated memory system, one user fact might be merged into a summary with others. Deleting the original message doesn't remove it from the abstraction.

Solution: Maintain a "source ID" link between abstract summaries and original raw messages. If a user deletes a message, flag derived summaries for re-computation.

class ConsolidatedMemory:
    id: str
    abstract_text: str
    source_message_ids: list[str]  # Track provenance
 
def handle_deletion_request(message_id):
    # Find all summaries that used this message
    affected = find_summaries_by_source(message_id)
 
    for summary in affected:
        # Remove the source, regenerate or delete summary
        summary.source_message_ids.remove(message_id)
        if len(summary.source_message_ids) == 0:
            delete_summary(summary.id)
        else:
            regenerate_summary(summary.id)

Limitations

SimpleMem has constraints worth understanding before adoption.

Lossy by design

Entropy filtering discards content permanently. If the filter incorrectly classifies something as low-value, that information is gone. The paper's 0.35 threshold was tuned for the LoCoMo benchmark. Other domains (legal, medical) may need lower thresholds to preserve more context.

Consolidation latency

Abstract representations generate via LLM calls during background consolidation. Until consolidation runs, recent memories remain fragmented. Real-time applications may need to trigger consolidation more frequently or maintain separate fast-access and consolidated stores.

Query complexity estimation

The lightweight classifier for query complexity adds a prediction step that can fail. Misclassifying a complex query as simple leads to under-retrieval and poor answers. The paper doesn't detail the classifier architecture or failure modes.

Benchmark scope

LoCoMo tests conversational memory specifically. Performance on other agent tasks (tool use, code generation, web browsing) remains unvalidated. The memory patterns in those domains may differ substantially.

Single-user assumption

The architecture assumes memories belong to a single user context. Multi-user or multi-tenant scenarios need additional isolation mechanisms not addressed in the paper.

Cold start period

The consolidation stage needs a critical mass of memories before it produces useful abstractions. In the first few days of deployment, the system operates without consolidated patterns. Expect the full accuracy benefits to emerge after a week or more of active usage. Plan your pilot accordingly: test with synthetic history or accept that Day 1 performance won't match the benchmarks.


Paper: arXiv:2601.02553 Authors: Jiaqi Liu, Yaofeng Su, Peng Xia, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, Huaxiu Yao (AIMING Lab) Code: github.com/aiming-lab/SimpleMem Original paper: arXiv | PDF | HTML

Authors

Jiaqi LiuAIMING Lab,Yaofeng SuAIMING Lab,Peng XiaAIMING Lab,Siwei HanAIMING Lab,Zeyu ZhengAIMING Lab,Cihang XieAIMING Lab,Mingyu DingAIMING Lab,Huaxiu YaoAIMING Lab

Cite this paper

Jiaqi Liu, Yaofeng Su, Peng Xia, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, Huaxiu Yao (2026). SimpleMem: 30x More Efficient Memory for LLM Agents. arXiv 2026.

Related Research