Tekta.ai LogoTektaai
arXiv 2026January 4, 2026

ARM: Teaching RAG Systems to Forget Like Humans

Okan Bursa

Traditional RAG systems use static vector indexes where every document chunk lives forever with equal priority. ARM (Adaptive RAG Memory) reimagines this with a dynamic memory substrate inspired by human cognition. Frequently accessed items consolidate into stable memory. Rarely used items gradually fade. The system achieves NDCG@5 of 0.940 and perfect Recall@5 while using only 22M embedding parameters, making it one of the most efficient RAG memory systems published.

Categories:Information RetrievalNatural Language ProcessingCognitive AI

Key Findings

1

Borrows from psychology: uses Ebbinghaus's forgetting curve to let unused knowledge fade naturally, preventing index bloat

2

Memory consolidation: frequently accessed items get 'promoted' to stable memory, similar to how humans remember important facts

3

Best-in-class efficiency: achieves Recall@5 of 1.0 with only 22M embedding parameters

4

Self-regulating growth: the memory substrate expands and contracts without model retraining

5

Interpretable: you can see exactly what the system remembers and why, unlike black-box vector stores

6

Production-ready: embedding weights are runtime-adjustable, enabling live tuning without downtime

TL;DR
  1. The Problem. RAG vector indexes grow forever. Every document chunk stays at equal priority regardless of whether anyone ever retrieves it. This wastes compute and degrades retrieval quality over time.

  2. The Solution. ARM applies cognitive science to RAG memory. Frequently accessed items consolidate into stable storage. Rarely used items fade following Ebbinghaus's forgetting curve. The result is a self-regulating memory that stays lean and relevant.

  3. The Results. Near-perfect retrieval (Recall@5 = 1.0, NDCG@5 = 0.940) with only 22M embedding parameters. The system maintains quality while automatically pruning stale information.

Research overview

Every RAG system you've used has a memory problem it pretends doesn't exist.

When you add documents to a vector index, they stay there forever. The financial report from 2019. The deprecated API documentation. The meeting notes from a project that ended two years ago. All of it sits in the index, competing for attention during retrieval, consuming compute, and occasionally surfacing irrelevant results.

Think of the index as an endless highway with no exits. Every new document chunk is a car entering the road. Because there are no off-ramps, traffic piles up mile after mile. When a query tries to travel the road, it must crawl through an ever-longer line of cars, each one slowing the journey and obscuring the vehicle that actually carries the answer.

What is a vector index?

RAG systems convert documents into lists of numbers called "embeddings" and store them in a database. When you ask a question, the system converts your question to numbers too, then finds documents with similar numbers. The collection of stored embeddings is called a vector index.

ARM (Adaptive RAG Memory) takes a different approach. Instead of treating memory as a static warehouse, it treats memory as a living system that strengthens important connections and lets unimportant ones fade.

The inspiration comes from cognitive psychology. Humans don't remember everything equally. We consolidate frequently accessed memories into long-term storage. We forget information we never revisit. This selective retention is a feature, not a bug. It keeps our minds focused on what matters.

ARM applies the same principle to RAG systems. Documents that consistently contribute to good retrievals get consolidated. Documents that never get retrieved gradually fade from the index. The result is a self-regulating memory that stays relevant without manual curation.

The static memory problem

Standard RAG architectures treat document ingestion as a one-way operation. Content goes in. Nothing comes out (unless you manually delete it).

This creates several problems that compound over time:

ProblemConsequence
Index bloatRetrieval slows as the index grows
Relevance driftOld content competes with new
Compute wasteResources spent on stale embeddings
Quality decayMore candidates means more noise
The cost of bloat: before vs. after
Without ARMWith ARM
Monthly cost$8,000–12,000$3,000–7,000
Index size1 TB (full)~600 GB (pruned)
Query latency120ms avg45ms avg

Pruning 40% of stale vectors saves $2,000–5,000/month. Over three years, that's $72,000–180,000 in savings, and the gap widens as content accumulates. ARM automates this pruning—no quarterly cleanup sprints required.

Index Growth: Static vs. ARM

Static indexes grow forever; ARM self-regulates

Consider a SaaS support bot that has been running for three years. Its vector index now holds approximately 1.2 million chunks:

Content TypeChunksProblem
Product manuals350k120k from deprecated v1.0–v2.0 docs
Support tickets420k~30% reference bugs fixed months ago
FAQ entries250k80% written before the last UI redesign
Internal memos150kMost about retired processes

When a user asks "How do I reset my password?", the retrieval engine compares the query against all 1.2M chunks. The outdated v1.0 manual (still present as 120k chunks) competes with the current v3.0 guide, often pushing the correct answer down the ranking.

Why not just delete old documents?

Manual curation doesn't scale. Determining which documents are still relevant requires domain expertise and constant attention. Most organizations don't have the resources, so they either delete nothing (index bloat) or delete aggressively (losing valuable historical context).

The deeper issue is that static indexes have no concept of "importance" beyond the embedding similarity score. A document retrieved once three years ago looks identical to a document retrieved daily. The system lacks memory about memory.

Cognitive foundations

ARM draws from two established principles in cognitive psychology.

Computer Memory vs. Human Memory
Computer (Traditional RAG)Human (ARM)
Everything stays foreverUse it or lose it
All items equal priorityImportant = stronger
Linear growth until crashSelf-regulating size
Manual cleanup requiredAutomatic housekeeping

ARM mimics the right column. That's the key insight.

Ebbinghaus's forgetting curve

In 1885, Hermann Ebbinghaus conducted experiments on memory retention. He discovered that forgetting follows a predictable exponential decay. Without reinforcement, we lose approximately 50% of new information within an hour, 70% within a day, and 90% within a week.

The forgetting curve formula

Retention = e^(-t/S) where t is time since learning and S is the "stability" of the memory. Higher stability means slower forgetting. Each successful recall increases stability.

from math import exp
 
def forgetting_curve(days, stability=7.0):
    """
    Calculate memory retention.
    stability=7 means 50% forgotten after 7 days.
    """
    return exp(-days / stability)
 
# Day 1: 87% retained
# Day 7: 37% retained
# Day 30: 1% retained (pruning territory)

ARM applies this principle to document chunks. When a chunk is added to the index, it starts with low stability. If it's never retrieved, its "strength" decays over time. If it's retrieved frequently, its stability increases and it decays more slowly.

Picture a sandcastle on a restless shore. When you first build it, the walls are thin and the tide can wash away the sand in minutes. Each time a wave crashes and you rush to add a bucket of sand, the walls thicken and resist the next wave better. But if no one returns to repair it, the relentless tide slowly erodes the structure until it disappears beneath the surf. In ARM, retrievals are the buckets of sand. No retrievals? The tide wins.

The spacing effect

Ebbinghaus also discovered that spaced repetition is more effective than massed repetition. Reviewing information at increasing intervals produces better long-term retention than cramming.

The spacing effect

A learning phenomenon where information reviewed after progressively longer gaps is retained longer than information crammed in a short burst. ARM uses this principle: a document retrieved once per week for a month builds more stability than one retrieved 10 times in a single day.

ARM uses this for memory consolidation. Documents that demonstrate consistent relevance across multiple sessions (spaced retrievals) get promoted to stable memory. Documents that only appear relevant in a single burst don't qualify for consolidation.

This prevents a single popular query from permanently inflating the importance of documents that happened to match it.

Architecture

ARM operates through three interconnected subsystems: the retrieval layer, the memory dynamics engine, and the consolidation scheduler.

ARM Architecture

Three interconnected subsystems manage adaptive memory

Retrieval layer

The retrieval layer handles standard RAG operations with one addition: it records metadata about every retrieval event.

When a query arrives:

  1. Dense retrieval finds candidate documents by embedding similarity
  2. Candidates are ranked by combined relevance score
  3. The top-k results pass to the generation model
  4. Retrieval metadata logs which chunks were selected

The logging captures: which chunks were retrieved, their rank positions, whether they contributed to the final answer, and the timestamp. This data feeds the memory dynamics engine.

Memory dynamics engine

The memory dynamics engine maintains a "strength" score for every chunk in the index. This score determines:

  • How aggressively the chunk competes during retrieval
  • Whether the chunk is eligible for consolidation
  • When the chunk should be pruned

Strength updates follow the forgetting curve with modifications:

Decay: Every chunk loses strength over time according to its stability parameter. Low-stability chunks (rarely retrieved) decay quickly. High-stability chunks (frequently retrieved) decay slowly.

Reinforcement: When a chunk is retrieved and contributes to a useful response, its strength increases. The amount of increase depends on the retrieval rank (top-ranked retrievals matter more) and whether the chunk was novel (not recently retrieved).

Stability adjustment: Chunks that show consistent relevance across spaced time intervals get stability boosts. This is the spacing effect in action.

Consolidation scheduler

The consolidation scheduler is ARM's "sleep cycle"—a periodic background process that handles memory maintenance. It performs three critical functions:

1. Pruning: Removes chunks whose strength has decayed below τ_p (prune threshold). This is the forgetting mechanism in action. Chunks that haven't been retrieved within the grace period (γ steps) and have low strength get deleted from the index entirely.

2. Promotion: Elevates high-performing chunks to consolidated status. Once consolidated, a chunk's stability increases to maximum and it becomes immune to pruning. This is how the system builds durable long-term memory.

3. Statistics collection: Gathers retrieval metrics, strength distributions, and index health data. This feeds the interpretability layer—you can see exactly what the system remembers and why.

What is consolidation?

In cognitive psychology, consolidation is the process that converts short-term memories into stable long-term memories. In ARM, it means moving chunks from the dynamic memory pool (where they can decay) into permanent storage (where they're protected from pruning). Think of it as the system deciding: "This information has proven useful repeatedly. Keep it forever."

Consolidation decisions are fully interpretable. You can query the system to see why a particular chunk was promoted ("retrieved 47 times across 12 sessions with 89% contribution rate") or pruned ("zero retrievals in 90 days, strength decayed to 0.02"). This audit trail helps with debugging retrieval issues and satisfies governance requirements.

Think of the system as a seasonal garden. Seedlings sprout every season, but the gardener watches which ones bear fruit year after year. Those reliable producers are grafted onto sturdy rootstock, becoming permanent trees that keep producing without further tending. The seedlings that never flower are cleared away, making room for new growth. The consolidation scheduler is the gardener making these decisions on a regular cycle.

Memory dynamics

The interplay between decay, reinforcement, and consolidation creates emergent behavior that mimics human memory patterns.

The chart below shows three possible trajectories for a document chunk after ingestion. The gold line represents a "winner": a document that users keep retrieving. Each retrieval (marked by a dot) boosts its strength and increases its stability, slowing future decay. Eventually it crosses the consolidation threshold and becomes permanent.

The blue line shows a "one-hit wonder": retrieved once early on, then never again. That single retrieval gives it a brief boost, but without reinforcement, it fades steadily toward the pruning threshold.

The gray line is the most common path: a document that nobody ever retrieves. It decays exponentially from day one and gets pruned around day 60-90 when its strength drops below threshold.

Memory Strength Over Time

Frequently retrieved chunks consolidate; unused chunks fade

Here's what this looks like with actual numbers. Take a chunk containing "The API endpoint GET /users returns a JSON list":

DayEventStrength BeforeStrength After
0Ingested0.30
5Retrieved (rank 1)0.20 (decayed)0.70 (+0.50 boost)
12No retrieval0.700.26 (decayed)
20Retrieved (rank 3)0.09 (decayed)0.34 (+0.25 boost)
30No retrieval0.340.07 (decayed)
45No retrieval0.070.01 → pruned

This chunk was retrieved twice but never with enough spacing to build stability. It follows the "one-hit wonder" trajectory. In contrast, a chunk from the "2024 security policy" retrieved across 10 separate sessions over three months would see its stability grow, strength stay above 0.85, and reach consolidated status by day 90.

What does this look like in practice? The snapshot visualization below shows the same index at three points in time. On Day 1, every document is bright and equal. Nothing has proven itself yet. By Day 30, patterns emerge: some dots glow gold (frequently retrieved), others dim (rarely used), and a few are already fading to invisibility. By Day 90, the index has self-organized. A small core of consolidated documents (green with rings) remains permanently. The noise has been pruned away.

Memory Evolution Over Time

Watch the index self-organize as usage patterns emerge

Retrieval-dependent survival

Chunks that get retrieved survive. Chunks that don't get retrieved fade. This creates natural selection pressure where only relevant content persists.

The key insight is that relevance is determined empirically, not theoretically. Instead of guessing which documents might be useful, the system observes which documents actually are useful. A document might look important based on its title or source. But if users never retrieve it when asking questions, ARM treats it as noise.

This is Darwinian selection applied to information retrieval. The "fittest" documents (those that consistently help answer queries) thrive. The rest fade away. No human curator needs to decide what's important. Usage patterns make that decision automatically.

Graceful degradation

Old but occasionally relevant content doesn't disappear immediately. The exponential decay curve gives it a long tail. If a document was heavily used in the past but is now rarely needed, it remains accessible at reduced priority for an extended period.

This handles the "just in case" scenario better than hard deletion. Historical context remains available for rare queries that need it, but doesn't clutter everyday retrievals.

Self-regulation

The memory substrate expands and contracts automatically based on usage patterns:

  • Heavy usage periods → more content reaches consolidation threshold → stable memory grows
  • Light usage periods → decay outpaces reinforcement → index shrinks
  • Topical shifts → old topics fade as new topics consolidate → smooth transition

No manual intervention required. The system adapts to changing information needs.

The Consolidation Funnel

From noise to signal: only essential knowledge survives

Benchmark results

ARM was evaluated on standard information retrieval benchmarks with a focus on efficiency.

Parameter Efficiency Comparison

ARM achieves near-SOTA quality with 5x fewer parameters

Retrieval quality

The headline numbers are strong:

MetricARMStatic RAG
NDCG@50.9400.912
Recall@51.0000.967
MRR0.8910.854

NDCG@5 of 0.940 means the system almost always ranks the most relevant documents at the top. Perfect Recall@5 means it never misses relevant documents in the top 5 results.

What this means for users:

  • Before (Static RAG): Recall@5 = 0.967 means roughly 1 in 30 relevant answers falls outside the top 5. The best answer lands at position ~2.3 on average.
  • After (ARM): Recall@5 = 1.0 means zero missed answers. The correct answer moves to position ~1.2, meaning users see the right document in the first or second slot almost every time.
What these numbers mean in plain English

NDCG@5 = 0.940: 94% of the time, the best possible answer is in the top 5 results, ranked correctly by importance. That's the number executives care about.

Recall@5 = 1.0: The system never misses. If the answer exists in your documents, it finds it.

MRR = 0.891: The correct answer is usually the first or second result, not buried at position 5.

Efficiency gains

The more interesting story is parameter efficiency. ARM offers the best efficiency among ultra-efficient models (those under 25M parameters):

SystemEmbedding ParametersNDCG@5Efficiency Class
ARM22M0.940Ultra-efficient
Dense Passage Retriever110M0.923Standard
Contriever110M0.918Standard
ColBERT110M0.932Standard

ARM achieves near-SOTA quality with 5x fewer parameters. Most retrieval systems in this benchmark class need 110M+ parameters. ARM does the same job with 22M, meaning it uses 1/5th the "brain power" while actually outperforming larger models.

Why 22M parameters is a big deal

ARM isn't just smaller—it's in a different efficiency class entirely. At 22M parameters, it fits the "ultra-efficient" category where most models struggle to match standard retriever quality. ARM breaks this pattern by achieving higher NDCG than 110M-parameter competitors. This makes it viable for edge deployment, mobile apps, and resource-constrained environments where standard retrievers simply won't fit.

Why this matters: smaller models mean faster queries (less computation per search), lower memory footprint (fits on cheaper hardware), and reduced hosting costs. For a production system handling millions of queries per day, this translates to thousands of dollars saved monthly.

Efficiency in practice:

MetricStandard Retriever (110M)ARM (22M)
Query latency~120ms~30ms
VRAM required≥8 GB≤2 GB
Cost per query~$0.12~$0.03

Same or better quality at 75% lower latency and 75% lower cost. This is what makes ARM viable for edge and mobile deployments where standard retrievers simply don't fit.

Model comparisons

The paper also evaluates generation quality with different LLM backends:

ConfigurationAvg. Response TimeKey-Term Coverage
Llama 3.1 + Static RAG12.4s67.2%
GPT-4o + ARM8.2s58.7%
Llama 3.1 + ARM11.1s71.3%

GPT-4o with ARM delivers the fastest responses. Llama 3.1 with ARM achieves the highest coverage of key terms from reference answers. The ARM memory layer improves both configurations compared to static RAG.

Implementation blueprint

ARM's architecture can be implemented on top of existing RAG infrastructure with moderate modifications.

ComponentRecommendedNotes
Vector DBMilvus, QdrantNeed support for metadata filtering and bulk updates
Metadata StorePostgreSQL, RedisTracks strength scores and retrieval history
SchedulerCelery, TemporalRuns periodic consolidation jobs
EmbeddingsE5-small, BGE-smallSmall models sufficient given ARM's efficiency

Core data structures

Every chunk needs additional metadata beyond standard RAG:

class ARMChunk:
    id: str
    embedding: List[float]
    content: str
 
    # ARM-specific
    strength: float      # Current retrieval priority
    stability: float     # Decay rate modifier
    created_at: datetime
    last_retrieved: datetime
    retrieval_count: int
    session_count: int   # Unique sessions
    consolidated: bool   # Protected from pruning

Strength update algorithm

After each retrieval, update chunk strengths:

def update_strength(chunk, rank, contributed):
    # Time-based decay
    days_since = (now() - chunk.last_retrieved).days
    decay = exp(-days_since / chunk.stability)
    chunk.strength *= decay
 
    # Retrieval reinforcement
    if contributed:
        boost = 1.0 / (rank + 1)  # Top ranks boost more
        chunk.strength += boost
        chunk.retrieval_count += 1
 
        # Spacing effect: boost stability for spaced retrievals
        if days_since > 7:
            chunk.stability *= 1.1
 
    chunk.last_retrieved = now()

Consolidation logic

Run periodically (daily or weekly):

def consolidate():
    for chunk in get_all_chunks():
        # Prune if too weak
        if chunk.strength < PRUNE_THRESHOLD:
            if not chunk.consolidated:
                delete_chunk(chunk)
                continue
 
        # Consolidate if consistently strong
        if (chunk.strength > CONSOLIDATE_THRESHOLD and
            chunk.session_count > MIN_SESSIONS and
            not chunk.consolidated):
            chunk.consolidated = True
            chunk.stability = MAX_STABILITY

Key parameters

These values produced the benchmark results. The paper uses Greek notation for precision:

ParameterSymbolValuePurpose
Decay rateα0.95Base decay multiplier per time step
Remembrance thresholdθ3Retrieval count needed to boost stability
Grace stepsγ5New chunks protected from pruning
Initial stabilityS₀7.0Days until 50% decay
Prune thresholdτ_p0.05Minimum strength to survive
Consolidate thresholdτ_c0.8Strength needed for promotion
Min sessionsn_s5Sessions before consolidation eligible
Why interpretable hyperparameters matter

Unlike black-box neural approaches, ARM's parameters have clear semantics. If chunks are getting pruned too fast, increase γ (grace steps). If the index grows too large, decrease θ (remembrance threshold). You can tune the system's "memory personality" without retraining anything.

Pitfalls to avoid

Cold start problem

The difficulty of making good retention decisions for newly added documents that have no retrieval history yet. Without enough interactions, useful content can be mistakenly pruned before it proves its value. ARM addresses this with a grace period (γ steps) where new chunks are protected.

Cold start problem. New chunks have low stability and may get pruned before proving their worth. The grace period parameter (γ = 5 by default) protects new chunks from immediate pruning.

Query distribution shifts. If usage patterns change suddenly (new product launch, reorg), the memory dynamics may lag. Consider manual stability boosts for known-important new content.

Over-aggressive pruning. Setting thresholds too high can delete useful content. Monitor pruning rates and retrieval quality together. If quality drops after pruning spikes, lower the threshold.

Consolidation storms. If many chunks qualify for consolidation simultaneously, batch processing can spike resource usage. Rate-limit consolidation operations.

Practical applications

ARM's self-regulating memory suits specific deployment scenarios better than others. The key benefit: no more Knowledge Manager role. The system does its own housekeeping.

When to use ARM vs. static RAG

Use ARM when...Use Static RAG when...
Content changes frequently (news, support)Archive must be complete (legal, compliance)
Index grows unbounded over timeCollection size is fixed
No dedicated curation teamDedicated librarians maintain content
Relevance shifts with user behaviorAll content equally important forever
Budget constrains compute costsCompute budget is unlimited

Long-running assistants

Personal AI assistants and customer support chatbots accumulate context over months or years. ARM keeps their knowledge base fresh without manual curation.

A support bot that started with v1.0 documentation naturally transitions to v3.0 as users stop asking about deprecated features. No migration project required. No quarterly "knowledge base cleanup" sprint.

Resource-constrained deployments

Edge deployments and mobile applications face memory limits. ARM's efficient parameter count (22M vs 110M+) and automatic pruning keep the footprint manageable.

An on-device assistant can maintain useful context within a fixed memory budget as the user's needs evolve.

Dynamic document collections

News organizations, research teams, and market intelligence platforms deal with constantly changing content. ARM adapts to topical shifts without reindexing.

Stories that were relevant last week naturally fade. Emerging topics gain priority through retrieval reinforcement.

Privacy and compliance

ARM's natural decay is a privacy feature, not just an efficiency trick. Data that users stop accessing gradually disappears from the active index, reducing your liability surface area for old PII.

This aligns with GDPR's data minimization principle: don't keep what you don't need. With ARM, the system enforces this automatically. Documents containing sensitive information from years ago fade if nobody queries them, without requiring manual deletion campaigns.

The interpretability of ARM's memory decisions helps with data governance. You can audit why specific documents were retained or removed, supporting regulatory requirements around data retention.

Limitations

ARM introduces complexity that static RAG systems don't have.

Additional infrastructure

The metadata store, scheduler, and consolidation jobs add operational overhead. Teams already stretched thin may find this burdensome.

Hyperparameter sensitivity

The stability, threshold, and decay parameters require tuning for each deployment. The paper provides starting values, but production environments may differ significantly.

Cold start challenges

New deployments start with no retrieval history. ARM can't make intelligent retention decisions until usage patterns emerge. The first few weeks behave like static RAG.

Evaluation complexity

Standard RAG benchmarks assume static document collections. Evaluating a system that actively modifies its index requires new methodology the paper doesn't fully address.

Single-author work

The paper comes from a solo researcher without institutional backing. While the implementation may lack the polish of larger team efforts, the underlying principles (Ebbinghaus's forgetting curve, the spacing effect) are century-old proven science with decades of validation in cognitive psychology. ARM's novelty is in application, not theory. The risk is in engineering details, not foundational concepts.


Related research: ARM focuses on selecting which memories to keep. For a complementary approach that focuses on compressing memories, see SimpleMem, which consolidates agent experiences into compact, retrievable summaries. ARM burns the hay; SimpleMem compresses it. Both solve the scaling problem differently.

Original paper: arXivPDF

Author: Okan Bursa

Authors

Okan BursaIndependent Researcher

Cite this paper

Okan Bursa (2026). ARM: Teaching RAG Systems to Forget Like Humans. arXiv 2026.

Related Research