← Back to article

SimpleMem: Efficient Lifelong Memory for LLM Agents

Jiaqi Liu    Yaofeng Su    Peng Xia    Siwei Han    Zeyu Zheng    Cihang Xie    Mingyu Ding    Huaxiu Yao
Abstract

To support reliable long-term interaction in complex environments, LLM agents require memory systems that efficiently manage historical experiences. Existing approaches either retain full interaction histories via passive context extension, leading to substantial redundancy, or rely on iterative reasoning to filter noise, incurring high token costs. To address this challenge, we introduce SimpleMem, an efficient memory framework based on semantic lossless compression. We propose a three-stage pipeline designed to maximize information density and token utilization: (1) Semantic Structured Compression, which applies entropy-aware filtering to distill unstructured interactions into compact, multi-view indexed memory units; (2) Recursive Memory Consolidation, an asynchronous process that integrates related units into higher-level abstract representations to reduce redundancy; and (3) Adaptive Query-Aware Retrieval, which dynamically adjusts retrieval scope based on query complexity to construct precise context efficiently. Experiments on benchmark datasets show that our method consistently outperforms baseline approaches in accuracy, retrieval efficiency, and inference cost, achieving an average F1 improvement of 26.4% while reducing inference-time token consumption by up to 30×, demonstrating a superior balance between performance and efficiency. Code is available at https://github.com/aiming-lab/SimpleMem.

Machine Learning, ICML

1 Introduction

Large Language Model (LLM) agents have recently demonstrated remarkable capabilities across a wide range of tasks (Xia et al., 2025; Team et al., 2025; Qiu et al., 2025). However, constrained by fixed context windows, existing agents exhibit significant limitations when engaging in long-context and multi-turn interaction scenarios (Liu et al., 2023; Wang et al., 2024a; Liu et al., 2025; Hu et al., 2025; Tu et al., 2025). To facilitate reliable long-term interaction, LLM agents require robust memory systems to efficiently manage and utilize historical experience (Dev & Taranjeet, 2024; Fang et al., 2025; Wang & Chen, 2025; Tang et al., 2025; Yang et al., 2025; Ouyang et al., 2025).

While recent research has extensively explored the design of memory modules for LLM agents, current systems still suffer from suboptimal retrieval efficiency and low token utilization (Fang et al., 2025; Hu et al., 2025). On one hand, many existing systems maintain complete interaction histories through full-context extension (Li et al., 2025; Zhong et al., 2024). However, this approach introduce substantial redundant information (Hu et al., 2025). Specifically, during long-horizon interactions, user inputs and model responses accumulate substantial low-entropy noise (e.g., repetitive logs, non-task-oriented dialogue), which degrades the effective information density of the memory buffer. This redundancy adversely affects memory retrieval and downstream reasoning, often leading to middle-context degradation phenomena (Liu et al., 2023), while also incurring significant computational overhead during retrieval and secondary inference. On the other hand, some agentic frameworks mitigate noise through online filtering based on iterative reasoning procedures (Yan et al., 2025; Packer et al., 2023). Although such approaches improve retrieval relevance, they rely on repeated inference cycles, resulting in substantial computational cost, including increased latency and token usage. As a result, neither paradigm achieves efficient allocation of memory and computation resources.

S1.F1
Figure 1:Performance vs. Efficiency Trade-off. Comparison of Average F1 against Average Token Cost on the LoCoMo benchmark. SimpleMem occupies the ideal top-left position, achieving high accurac with minimal token consumption ( ∼ 550 tokens).

To address these limitations, we introduce SimpleMem, an efficient memory framework inspired by the Complementary Learning Systems (CLS) theory (Kumaran et al., 2016) and designed around structured semantic compression. The core objective of SimpleMem is to improve information efficiency under fixed context and token budgets. To this end, we develop a three-stage pipeline that supports dynamic memory compression, organization, and adaptive retrieval: (1) Semantic Structured Compression: we apply an entropy-aware filtering mechanism that preserves information with high semantic utility while discarding redundant or low-value content. The retained information is reformulated into compact memory units and jointly indexed using dense semantic embeddings, sparse lexical features, and symbolic metadata, enabling multi-granular retrieval. (2) Recursive Memory Consolidation: Inspired by biological consolidation, we introduce an asynchronous process that incrementally reorganizes stored memory. Rather than accumulating episodic records verbatim, related memory units are recursively integrated into higher-level abstract representations, allowing repetitive or structurally similar experiences to be summarized while reducing semantic redundancy. (3) Adaptive Query-Aware Retrieval: we employ a query-aware retrieval strategy that dynamically adjusts retrieval scope based on estimated query complexity. Irrelevant candidates are pruned through lightweight symbolic and semantic constraints, enabling precise context construction tailored to task requirements. This adaptive mechanism achieves a favorable trade-off between reasoning performance and token efficiency.

Our primary contribution is SimpleMem, an efficient memory framework grounded in structured semantic compression, which improves information efficiency through principled memory organization, consolidation, and adaptive retrieval. As shown in Figure 1, our empirical experiments demonstrate that SimpleMem establishes a new state-of-the-art with an F1 score, outperforming strong baselines like Mem0 by 26.4%, while reducing inference token consumption by 30×\times compared to full-context models.

2 The SimpleMem Architecture

In this section, we present SimpleMem, an efficient memory framework for LLM agents designed to improve information utilization under constrained context and token budgets through. As shown in Figure 2, the system operates through a three-stage pipeline. First, we describe Semantic Structured Compression process, which filters redundant interaction content and reformulates raw dialogue streams into compact memory units. Next, we describe Recursive Consolidation, an asynchronous process that incrementally integrates related memory units into higher-level abstract representations and maintaining a compact memory topology. Finally, we present Adaptive Query-Aware Retrieval, which dynamically adjusts retrieval scope based on estimated query complexity to construct precise and token-efficient contexts for downstream reasoning.

S2.F2
Figure 2:The SimpleMem Architecture. SimpleMem mitigates context inflation through three stages. (1) Semantic Structured Compression filters redundant interaction content and reformulates raw dialogue into compact, context-independent memory units. (2) Recursive Consolidation incrementally organizes related memory units into higher-level abstract representations, reducing redundancy in long-term memory. (3) Adaptive Query-Aware Retrieval dynamically adjusts retrieval scope based on query complexity, enabling efficient context construction under constrained token budgets.

2.1 Semantic Structured Compression

A primary bottleneck in long-term interaction is context inflation, the accumulation of raw, low-entropy dialogue. For example, a large portion of interaction segments in the real-world consists of phatic chit-chat or redundant confirmations, which contribute little to downstream reasoning but consume substantial context capacity. To address this, we introduce a mechanism to actively filter and restructure information at the source.

First, incoming dialogue is segmented into overlapping sliding windows WtW_{t} of fixed length, where each window represents a short contiguous span of recent interaction. These windows serve as the basic units for evaluating whether new information should be stored. Then we employ a non-linear gating mechanism, Φgate\Phi_{gate}, to evaluate the information density of these dialogue windows to determine which windows is used fo indexing. For each window WtW_{t}, we compute an information score H(WT)H(W_{T}) that jointly captures the introduction of new entities and semantic novelty relative to the immediate interaction history HprevH_{\text{prev}}.

Formally, let new\mathcal{E}_{new} denote the set of named entities that appear in WtW_{t} but not in HprevH_{\text{prev}}. The information score is defined as:

H(Wt)=α|new||Wt|+(1α)(1cos(E(Wt),E(Hprev)))H(W_{t})=\alpha\cdot\frac{|\mathcal{E}_{new}|}{|W_{t}|}+(1-\alpha)\cdot(1-\cos(E(W_{t}),E(H_{prev}))) (1)

where E()E(\cdot) denotes a semantic embedding function and α\alpha controls the relative importance of entity-level novelty and semantic divergence.

Windows whose information score falls below threshold τredundant\tau_{\text{redundant}} are treated as redundant and excluded from memory construction, meaning that the window is neither stored nor further processed, preventing low-utility interaction content from entering the memory buffer. For informative windows, the system proceeds to a segmentation step:

Action(Wt)={Segment(Wt),H(Wt)τredundant,,otherwise.\text{Action}(W_{t})=\begin{cases}\text{Segment}(W_{t}),&H(W_{t})\geq\tau_{\text{redundant}},\\ \varnothing,&\text{otherwise}.\end{cases} (2)

For windows that pass the filter, we apply a segmentation function θ\mathcal{F}_{\theta} to decompose each informative window into a set of context-independent memory units mk{m_{k}}. This transformation resolves dependencies implicit in conversational flow by converting entangled dialogue into self-contained factual or event-level statements. Formally, θ\mathcal{F}{\theta} is composed of a coreference resolution module (Φcoref\Phi_{\text{coref}}) and a temporal anchoring module (Φtime\Phi_{\text{time}}):

mk=θ(Wt)=ΦtimeΦcorefΦextract(Wt)m_{k}=\mathcal{F}_{\theta}(W_{t})=\Phi_{time}\circ\Phi_{coref}\circ\Phi_{extract}(W_{t}) (3)

Here, Φextract\Phi_{\text{extract}} identifies candidate factual statements, (Φcoref\Phi_{coref}) replaces ambiguous pronouns with specific entity names (e.g., changing "He agreed" to "Bob agreed"), and Φtime\Phi_{\text{time}} converts relative temporal expressions (e.g., transforming "next Friday" to "2025-10-24") into absolute ISO-8601 timestamps. This normalization ensures that each memory unit remains interpretable and valid independent of its original conversational context.

2.2 Structured Indexing and Recursive Consolidation

Then, the system need organize the resulting memory units to support efficient long-term storage and scalable retrieval. This stage consists of two components: (i) structured multi-view indexing for immediate access, and (ii) recursive consolidation for reducing redundancy and maintaining a compact memory topology over time.

To support flexible and precise retrieval, each memory unit is indexed through three complementary representations. First, at sematic layer, we map the entry to a dense vector space 𝐯k\mathbf{v}_{k} using embedding models, which captures abstract meaning and enables fuzzy matching (e.g., retrieving "latte" when querying "hot drink"). Second, the Lexical Layer generates a sparse representation focusing on exact keyword matches and proper nouns, ensuring that specific entities are not diluted in vector space. Third, the Symbolic Layer extracts structured metadata, such as timestamps and entity types, to enable deterministic filtering logic. Formally, these projections form the comprehensive memory bank 𝕄\mathbb{M}:

𝕄(mk)={𝐯k=Edense(Sk)d(Semantic Layer)𝐡k=Sparse(Sk)|V|(Lexical Layer)k={(key,val)}(Symbolic Layer)\mathbb{M}(m_{k})=\begin{cases}\mathbf{v}_{k}=E_{\text{dense}}(S_{k})\in\mathbb{R}^{d}&\text{(Semantic Layer)}\\ \mathbf{h}_{k}=\text{Sparse}(S_{k})\in\mathbb{R}^{|V|}&\text{(Lexical Layer)}\\ \mathcal{R}_{k}=\{(\text{key},\text{val})\}&\text{(Symbolic Layer)}\end{cases} (4)

It allows the system to flexibly query information based on conceptual similarity, exact keyword matches, or structured metadata constraints.

While multi-view indexing supports efficient access, naively accumulating memory units over long interaction horizons leads to redundancy and fragmentation. To address this issue, we then introduces an asynchronous background consolidation process that incrementally reorganizes the memory topology. The consolidation mechanism identifies related memory units based on both semantic similarity and temporal proximity. For two memory units mim_{i} and mjm_{j}, we define an affinity score ωij\omega_{ij} as:

ωij=βcos(𝐯i,𝐯j)+(1β)eλ|titj|,\omega_{ij}=\beta\cdot\cos(\mathbf{v}_{i},\mathbf{v}_{j})+(1-\beta)\cdot e^{-\lambda|t_{i}-t_{j}|}, (5)

where the first term captures semantic relatedness and the second term biases the model toward grouping events with strong temporal proximity.

When a group of memory units forms a dense cluster 𝒞\mathcal{C}, determined by pairwise affinities exceeding a threshold τcluster\tau_{\text{cluster}}, the system performs a consolidation step:

Mabs=𝒢syn({mimi𝒞}).M_{\text{abs}}=\mathcal{G}_{\text{syn}}(\{m_{i}\mid m_{i}\in\mathcal{C}\}). (6)

This operation synthesizes repetitive or closely related memory units into a higher-level abstract representation MabsM_{\text{abs}}, which captures their shared semantic structure. For example, instead of maintaining numerous individual records such as ‘‘the user ordered a latte at 8:00 AM,’’ the system consolidates them into a single abstract pattern, e.g., ‘‘the user regularly drinks coffee in the morning.’’ The original fine-grained entries are archived, reducing the active memory size while preserving the ability to recover detailed information if needed. As a result, the active memory index remains compact, and retrieval complexity scales gracefully with long-term interaction history.

2.3 Adaptive Query-Aware Retrieval

After memory entries are organized, another challenge to retrieve relevant information efficiently under constrained context budgets. Standard retrieval approaches typically fetch a fixed number of context entries, which often results in either insufficient information or token wastage. To address this, we introduces an adaptive query-aware retrieval mechanism that dynamically adjusts retrieval scope based on estimated query complexity, thereby improving retrieval efficiency without sacrificing reasoning accuracy.

First, we propose a hybrid scoring function for information retrieval, 𝒮(q,mk)\mathcal{S}(q,m_{k}), which aggregates signals from the tri-layer index established in the second stage. For a given query qq, the relevance score is computed as:

𝒮(q,mk)=\displaystyle\mathcal{S}(q,m_{k})= λ1cos(𝐞q,𝐯k)+λ2BM25(qlex,Sk)\displaystyle\lambda_{1}\cos(\mathbf{e}_{q},\mathbf{v}_{k})+\lambda_{2}\text{BM25}(q_{\text{lex}},S_{k}) (7)
+γ𝕀(k𝒞meta),\displaystyle+\gamma\,\mathbb{I}(\mathcal{R}_{k}\models\mathcal{C}_{\text{meta}}),

where the first term measures semantic similarity in the dense embedding space, the second term captures exact lexical relevance, and the indicator function 𝕀()\mathbb{I}(\cdot) enforces hard symbolic constraints such as entity-based filters.

Then, based on the hybrid scoring, we can rank the candidate memories by relevance. However, retrieving a fixed number of top-ranked entries remains inefficient when query demands vary. To address this, we estimate the query complexity Cq[0,1]C_{q}\in[0,1], which reflects whether a query can be resolved via direct fact lookup or requires multi-step reasoning over multiple memory entries. A lightweight classifier predicts CqC_{q} based on query features such as length, syntactic structure, and abstraction level.

kdyn=kbase(1+δCq)k_{dyn}=\lfloor k_{base}\cdot(1+\delta\cdot C_{q})\rfloor (8)

Based on this dynamic depth, the system modulates the retrieval scope. For low-complexity queries (Cq0C_{q}\to 0), the system retrieves only the top-kmink_{min} high-level abstract memory entries or metadata summaries, minimizing token usage. Conversely, for high-complexity queries (Cq1C_{q}\to 1), it expands the scope to top-kmaxk_{max}, including a larger set of relevant entries, along with associated fine-grained details. The final context 𝒞final\mathcal{C}_{final} is synthesized by concatenating these pruned results, ensuring high accuracy with minimal computational waste:

𝒞final=mTop-kdyn(𝒮)[tm:Content(m)]\mathcal{C}_{final}=\bigoplus_{m\in\text{Top-}k_{dyn}(\mathcal{S})}[t_{m}:\text{Content}(m)] (9)

3 Experiments

In this section, we evaluate SimpleMem on the benchmark to answer the following research questions: (1) Does SimpleMem outperform other memory systems in complex long-term reasoning and temporal grounding tasks? (2) Can SimpleMem achieve a superior trade-off between retrieval accuracy and token consumption? (3) How effective are the proposed components? (4) What factors account for the observed performance and efficiency gains?

S3.T1
Table 1:Performance on the LoCoMo benchmark with High-Capability Models (GPT-4.1 series and Qwen3-Plus). SimpleMem achieves superior efficiency-performance balance.
S3.T2
Table 2:Performance on the LoCoMo benchmark with Efficient Models (Small parameters). SimpleMem demonstrates robust performance even on 1.5B/3B models, often surpassing larger models using baseline memory systems.

3.1 Experimental Setup

Benchmark Dataset. We utilize the LoCoMo benchmark (Maharana et al., 2024), which is specifically designed to test the limits of LLMs in processing long-term conversational dependencies. The dataset comprises conversation samples ranging from 200 to 400 turns, containing complex temporal shifts and interleaved topics. The evaluation set consists of 1,986 questions categorized into four distinct reasoning types: (1) Multi-Hop Reasoning: Questions requiring the synthesis of information from multiple disjoint turns (e.g., ‘‘Based on what X said last week and Y said today...’’); (2) Temporal Reasoning: Questions testing the model’s ability to understand event sequencing and absolute timelines (e.g., ‘‘Did X happen before Y?’’); (3) Open Domain: General knowledge questions grounded in the conversation context; (4) Single Hop: Direct retrieval tasks requiring exact matching of specific facts.

Baselines. We compare SimpleMem with representative memory-augmented systems: LoCoMo (Maharana et al., 2024), ReadAgent (Lee et al., 2024), MemoryBank (Zhong et al., 2024), MemGPT (Packer et al., 2023), A-Mem (Xu et al., 2025), LightMem (Fang et al., 2025), and Mem0 (Dev & Taranjeet, 2024).

Backbone Models. To test robustness across capability scales, we instantiate each baseline and SimpleMem on multiple LLM backends: GPT-4o, GPT-4.1-mini, Qwen-Plus, Qwen2.5 (1.5B/3B), and Qwen3 (1.7B/8B).

Implementation Details. For semantic structured compression, we use a sliding window of size W=10W=10 and set the entropy-based significance threshold to τ=0.35\tau=0.35 to filter low-information interaction content. Memory indexing is implemented using LanceDB with a multi-view design: text-embedding-3-small (1536 dimensions) for dense semantic embeddings, BM25 for sparse lexical indexing, and SQL-based metadata storage for symbolic attributes. Recursive consolidation is triggered when the average pairwise semantic similarity within a memory cluster exceeds τcluster=0.85\tau_{\text{cluster}}=0.85. During retrieval, we employ adaptive query-aware retrieval, where the retrieval depth is dynamically adjusted based on estimated query complexity, ranging from kmin=3k_{\min}=3 for simple lookups to kmax=20k_{\max}=20 for complex reasoning queries.

Evaluation Metrics. We report: F1 and BLEU-1 (accuracy), Adversarial Success Rate (robustness to distractors), and Token Cost (retrieval/latency efficiency). LongMemEval-S uses its standard accuracy-style metric.

3.2 Main Results and Analysis

We evaluate SimpleMem across a diverse set of LLMs, ranging from high-capability proprietary models (GPT-4o series) to efficient open-source models (Qwen series). Tables 1 and 2 present the detailed performance comparison on the LoCoMo benchmark.

Performance on High-Capability Models. As shown in Table 1, SimpleMem consistently outperforms existing memory systems across all evaluated models. On GPT-4.1-mini, SimpleMem achieves an Average F1 of 43.24, establishing a significant margin over the strongest baseline, Mem0 (34.20), and surpassing the full-context baseline (LoCoMo, 18.70) by over 24 points. Notable gains are observed in Temporal Reasoning, where SimpleMem scores 58.62 F1 compared to Mem0’s 48.91, demonstrating the effectiveness of our Semantic Structured Compression in resolving complex timelines. Similarly, on the flagship GPT-4o, SimpleMem maintains its lead with an Average F1 of 39.06, outperforming Mem0 (36.09) and A-Mem (33.45). These results confirm that Recursive Consolidation mechanism effectively distills high-density knowledge, enabling even smaller models equipped with SimpleMem to outperform larger models using traditional memory systems.

Token Efficiency. A key strength of SimpleMem lies in its inference-time efficiency. As reported in the rightmost columns of Tables 1 and 2, full-context approaches such as LoCoMo and MemGPT consume approximately 16,900 tokens per query. In contrast, SimpleMem reduces token usage by roughly 30×30\times, averaging 530–580 tokens per query. Furthermore, compared to optimized retrieval baselines like Mem0 (\sim980 tokens) and A-Mem (\sim1,200+ tokens), SimpleMem reduces token usage by 40-50% while delivering superior accuracy. For instance, on GPT-4.1-mini, SimpleMem uses only 531 tokens to achieve state-of-the-art performance, whereas ReadAgent consumes more (643 tokens) but achieves far lower accuracy (7.16 F1). This validates the efficacy of our Entropy-based Filtering and Adaptive Pruning, which strictly control context bandwidth without sacrificing information density.

Performance on Smaller Models. Table 2 highlights the ability of SimpleMem to empower smaller parameter models. On Qwen3-8b, SimpleMem achieves an impressive Average F1 of 33.45, significantly surpassing Mem0 (25.80) and LightMem (22.23). Crucially, a 3B-parameter model (Qwen2.5-3b) paired with SimpleMem achieves 17.98 F1, outperforming the same model with Mem0 (13.03) by nearly 5 points. Even on the extremely lightweight Qwen2.5-1.5b, SimpleMem maintains robust performance (25.23 F1), beating larger models using inferior memory strategies (e.g., Qwen3-1.7b with Mem0 scores 21.19).

Robustness Across Task Types. Breaking down performance by task, SimpleMem demonstrates balanced capabilities. In SingleHop QA, it consistently leads (e.g., 51.12 F1 on GPT-4.1-mini), proving precision in factual retrieval. In complex MultiHop scenarios, SimpleMem significantly outperforms Mem0 and LightMem on GPT-4.1-mini, indicating that our Molecular Representations successfully bridge disconnected facts, enabling deep reasoning without the need for expensive iterative retrieval loops.

3.3 Efficiency Analysis

We conduct a comprehensive evaluation of computational efficiency, examining both end-to-end system latency and the scalability of memory indexing and retrieval. To assess practical deployment viability, we measured the full lifecycle costs on the LoCoMo-10 dataset using GPT-4.1-mini.

As illustrated in Table 3, SimpleMem exhibits superior efficiency across all operational phases. In terms of memory construction, our system achieves the fastest processing speed at 92.6 seconds per sample. This represents a dramatic improvement over existing baselines, outperforming Mem0 by approximately 14×\times (1350.9s) and A-Mem by over 50×\times (5140.5s). This massive speedup is directly attributable to our Semantic Structured Compression pipeline, which processes data in a streamlined single pass, thereby avoiding the complex graph updates required by Mem0 or the iterative summarization overheads inherent to A-Mem.

Beyond construction, SimpleMem also maintains the lowest retrieval latency at 388.3 seconds per sample, which is approximately 33% faster than LightMem and Mem0. This gain arises from the adaptive retrieval mechanism, which dynamically limits retrieval scope and prioritizes high-level abstract representations before accessing fine-grained details. By restricting retrieval to only the most relevant memory entries, the system avoids the expensive neighbor traversal and expansion operations that commonly dominate the latency of graph-based memory systems.

When considering the total time-to-insight, SimpleMem achieves a 4×\times speedup over Mem0 and a 12×\times speedup over A-Mem. Crucially, this efficiency does not come at the expense of performance. On the contrary, SimpleMem achieves the highest Average F1 among all compared methods. These results support our central claim that structured semantic compression and adaptive retrieval produce a more compact and effective reasoning substrate than raw context retention or graph-centric memory designs, enabling a superior balance between accuracy and computational efficiency.

S3.T3
Table 3:Comparison of construction time, retrieval time, total experiment time, and average F1 score across different memory systems (tested on LoCoMo-10 with GPT-4.1-mini).

3.4 Ablation Study

To verify the claims that specific cognitive mechanisms correspond to computational gains, we conducted a component-wise ablation study using the GPT-4.1-mini backend. We investigate the contribution of three key components: (1) Semantic Structured Compression , (2) Recursive Consolidation, and (3) Adaptive Query-Aware Retrieval. The results are summarized in Table 4.

S3.T4
Table 4:Full Ablation Analysis with GPT-4.1-mini backend. The "Diff" columns indicate the percentage drop relative to the full SimpleMem model. The results confirm that each stage contributes significantly to specific reasoning capabilities.

Impact of Semantic Structured Compression. Replacing the proposed compression pipeline with standard chunk-based storage leads to a substantial degradation in temporal reasoning performance. Specifically, removing semantic structured compression reduces the Temporal F1 by 56.7%, from 58.62 to 25.40. This drop indicates that without context normalization steps such as resolving coreferences and converting relative temporal expressions into absolute timestamps, the retriever struggles to disambiguate events along the timeline. As a result, performance regresses to levels comparable to conventional retrieval-augmented generation systems that rely on raw or weakly structured context.

Impact of Recursive Consolidation. Disabling the background consolidation process results in a 31.3% decrease in multi-hop reasoning performance. Without consolidating related memory units into higher-level abstract representations, the system must retrieve a larger number of fragmented entries during reasoning. This fragmentation increases context redundancy and exhausts the available context window in complex queries, demonstrating that recursive consolidation is essential for synthesizing dispersed evidence into compact and informative representations.

Impact of Adaptive Query-Aware Retrieval. Removing the adaptive retrieval mechanism and reverting to fixed-depth retrieval primarily degrades performance on open-domain and single-hop tasks, with drops of 26.6% and 19.4%, respectively. In the absence of query-aware adjustment, the system either retrieves insufficient context for entity-specific queries or introduces excessive irrelevant information for simple queries. These results highlight the importance of dynamically modulating retrieval scope to balance relevance and efficiency during inference.

3.5 Case Study: Long-Term Temporal Grounding

To illustrate how SimpleMem handles long-horizon conversational history, Figure 3 presents a representative multi-session example spanning two weeks and approximately 24,000 raw tokens. SimpleMem filters low-information dialogue during ingestion and retains only high-utility memory entries, reducing the stored memory to about 800 tokens without losing task-relevant content.

Temporal Normalization. Relative temporal expressions such as last week” and yesterday” refer to different absolute times across sessions. SimpleMem resolves it into absolute timestamps at memory construction time, ensuring consistent temporal grounding over long interaction gaps.

Precise Retrieval. When queried about Sarah’s past artworks, the adaptive retrieval mechanism combines semantic relevance with symbolic constraints to exclude unrelated activities and retrieve only temporally valid entries. The system correctly identifies relevant paintings while ignoring semantically related but irrelevant topics. This example demonstrates how structured compression, temporal normalization, and adaptive retrieval jointly enable reliable long-term reasoning under extended interaction histories.

S3.F3
Figure 3:A Case of SimpleMem for Long-Term Multi-Session Dialogues. SimpleMem processes multi-session dialogues by filtering redundant content, normalizing temporal references, and organizing memories into compact representations. During retrieval, it adaptively combines semantic, lexical, and symbolic signals to select relevant entries.

4 Related Work

Memory Systems for LLM Agents. Recent approaches manage memory through virtual context or structured representations. Virtual context methods, including MemGPT (Packer et al., 2023), MemoryOS (Kang et al., 2025), and SCM (Wang et al., 2023), extend interaction length via paging or stream-based controllers (Wang et al., 2024b) but typically store raw conversation logs, leading to redundancy and increasing processing costs. In parallel, structured and graph-based systems, such as MemoryBank (Zhong et al., 2024), Mem0 (Dev & Taranjeet, 2024), Zep (Rasmussen et al., 2025), A-Mem (Xu et al., 2025), and O-Mem (Wang et al., 2025), impose structural priors to improve coherence but still rely on raw or minimally processed text, preserving referential and temporal ambiguities that degrade long-term retrieval. In contrast, SimpleMem adopts a semantic compression mechanism that converts dialogue into independent, self-contained facts, explicitly resolving referential and temporal ambiguities prior to storage.

Context Management and Retrieval Efficiency. Beyond memory storage, efficient access to historical information remains a core challenge. Existing approaches primarily rely on either long-context models or retrieval-augmented generation (RAG). Although recent LLMs support extended context windows (OpenAI, 2025; Deepmind, 2025; Anthropic, 2025), and prompt compression methods aim to reduce costs (Jiang et al., 2023a; Liskavetsky et al., 2025), empirical studies reveal the “Lost-in-the-Middle” effect (Liu et al., 2023; Kuratov et al., 2024), where reasoning performance degrades as context length increases, alongside prohibitive computational overhead for lifelong agents. RAG-based methods (Lewis et al., 2020; Asai et al., 2023; Jiang et al., 2023b), including structurally enhanced variants such as GraphRAG (Edge et al., 2024; Zhao et al., 2025) and LightRAG (Guo et al., 2024), decouple memory from inference but are largely optimized for static knowledge bases, limiting their effectiveness for dynamic, time-sensitive episodic memory. In contrast, SimpleMem improves retrieval efficiency through Adaptive Pruning and Retrieval, jointly leveraging semantic, lexical, and metadata signals to enable precise filtering by entities and timestamps, while dynamically adjusting retrieval depth based on query complexity to minimize token usage.

5 Conclusion

We introduce SimpleMem, an efficient memory architecture governed by the principle of Semantic Lossless Compression. By reimagining memory as a metabolic process, SimpleMem implements a dynamic continuum: Semantic Structured Compression to filter noise at the source, Recursive Consolidation to evolve fragmented facts into high-order molecular insights, and Adaptive Spatial Pruning to dynamically modulate retrieval bandwidth. Empirical evaluation on the LoCoMo benchmark demonstrates the effectiveness and efficiency of SimpleMem.

Acknowledgement

This work is partially supported by Amazon Research Award, Cisco Faculty Research Award, and Coefficient Giving.

References

Appendix A Detailed System Prompts

To ensure full reproducibility of the SimpleMem pipeline, we provide the exact system prompts used in the key processing stages. All prompts are designed to be model-agnostic but were optimized for GPT-4o-mini in our experiments to ensure cognitive economy.

A.1 Stage 1: Semantic Structured Compression Prompt

This prompt performs entropy-aware filtering and context normalization. Its goal is to transform raw dialogue windows into compact, context-independent memory units while excluding low-information interaction content.

LST1
Listing 1: Prompt for Semantic Structured Compression and Normalization.

A.2 Stage 2: Adaptive Retrieval Planning Prompt

This prompt analyzes the user query prior to retrieval. Its purpose is to estimate query complexity and generate a structured retrieval plan that adapts retrieval scope accordingly.

LST2
Listing 2: Prompt for Query Analysis and Adaptive Retrieval Planning.

A.3 Stage 3: Reconstructive Synthesis Prompt

This prompt guides the final answer generation using retrieved memory. It combines high-level abstract representations with fine-grained factual details to produce a grounded response.

LST3
Listing 3: Prompt for Reconstructive Synthesis (Answer Generation).

Appendix B Extended Implementation Details and Experiments

B.1 Hyperparameter Configuration

Table 6 summarizes the hyperparameters used to obtain the results reported in Section 3. These values were selected to balance memory compactness and retrieval recall, with particular attention to the thresholds governing semantic structured compression and recursive consolidation.

B.2 Hyperparameter Sensitivity Analysis

To assess the effectiveness of semantic structured compression and to motivate the design of adaptive retrieval, we analyze system sensitivity to the number of retrieved memory entries (kk). We vary kk from 1 to 20 and report the average F1 score on the LoCoMo benchmark using the GPT-4.1-mini backend.

A2.T5
Table 5:Performance sensitivity to retrieval count ( 𝑘 ). SimpleMem demonstrates "Rapid Saturation," reaching near-optimal performance at 𝑘 = 3 (42.85) compared to its peak at 𝑘 = 10 (43.45). This validates the high information density of Atomic Entries, proving that huge context windows are often unnecessary for accuracy.

Table 5 provides two key observations. First, rapid performance saturation is observed at low retrieval depth. SimpleMem achieves strong performance with a single retrieved entry (35.20 F1) and reaches approximately 99% of its peak performance at k=3k=3. This behavior indicates that semantic structured compression produces memory units with high information content, often sufficient to answer a query without aggregating many fragments.

Second, robustness to increased retrieval depth distinguishes SimpleMem from baseline methods. While approaches such as MemGPT experience performance degradation at larger kk, SimpleMem maintains stable accuracy even when retrieving up to 20 entries. This robustness enables adaptive retrieval to safely expand context for complex reasoning tasks without introducing excessive irrelevant information.

A2.T6
Table 6:Detailed hyperparameter configuration for SimpleMem. The system employs adaptive thresholds to balance memory compactness and retrieval effectiveness.