-
The Problem. RAG vector indexes grow forever. Every document chunk stays at equal priority regardless of whether anyone ever retrieves it. This wastes compute and degrades retrieval quality over time.
-
The Solution. ARM applies cognitive science to RAG memory. Frequently accessed items consolidate into stable storage. Rarely used items fade following Ebbinghaus's forgetting curve. The result is a self-regulating memory that stays lean and relevant.
-
The Results. Near-perfect retrieval (Recall@5 = 1.0, NDCG@5 = 0.940) with only 22M embedding parameters. The system maintains quality while automatically pruning stale information.
Research overview
Every RAG system you've used has a memory problem it pretends doesn't exist.
When you add documents to a vector index, they stay there forever. The financial report from 2019. The deprecated API documentation. The meeting notes from a project that ended two years ago. All of it sits in the index, competing for attention during retrieval, consuming compute, and occasionally surfacing irrelevant results.
Think of the index as an endless highway with no exits. Every new document chunk is a car entering the road. Because there are no off-ramps, traffic piles up mile after mile. When a query tries to travel the road, it must crawl through an ever-longer line of cars, each one slowing the journey and obscuring the vehicle that actually carries the answer.
RAG systems convert documents into lists of numbers called "embeddings" and store them in a database. When you ask a question, the system converts your question to numbers too, then finds documents with similar numbers. The collection of stored embeddings is called a vector index.
ARM (Adaptive RAG Memory) takes a different approach. Instead of treating memory as a static warehouse, it treats memory as a living system that strengthens important connections and lets unimportant ones fade.
The inspiration comes from cognitive psychology. Humans don't remember everything equally. We consolidate frequently accessed memories into long-term storage. We forget information we never revisit. This selective retention is a feature, not a bug. It keeps our minds focused on what matters.
ARM applies the same principle to RAG systems. Documents that consistently contribute to good retrievals get consolidated. Documents that never get retrieved gradually fade from the index. The result is a self-regulating memory that stays relevant without manual curation.
The static memory problem
Standard RAG architectures treat document ingestion as a one-way operation. Content goes in. Nothing comes out (unless you manually delete it).
This creates several problems that compound over time:
| Problem | Consequence |
|---|---|
| Index bloat | Retrieval slows as the index grows |
| Relevance drift | Old content competes with new |
| Compute waste | Resources spent on stale embeddings |
| Quality decay | More candidates means more noise |
| Without ARM | With ARM | |
|---|---|---|
| Monthly cost | $8,000–12,000 | $3,000–7,000 |
| Index size | 1 TB (full) | ~600 GB (pruned) |
| Query latency | 120ms avg | 45ms avg |
Pruning 40% of stale vectors saves $2,000–5,000/month. Over three years, that's $72,000–180,000 in savings, and the gap widens as content accumulates. ARM automates this pruning—no quarterly cleanup sprints required.
Index Growth: Static vs. ARM
Static indexes grow forever; ARM self-regulates
Consider a SaaS support bot that has been running for three years. Its vector index now holds approximately 1.2 million chunks:
| Content Type | Chunks | Problem |
|---|---|---|
| Product manuals | 350k | 120k from deprecated v1.0–v2.0 docs |
| Support tickets | 420k | ~30% reference bugs fixed months ago |
| FAQ entries | 250k | 80% written before the last UI redesign |
| Internal memos | 150k | Most about retired processes |
When a user asks "How do I reset my password?", the retrieval engine compares the query against all 1.2M chunks. The outdated v1.0 manual (still present as 120k chunks) competes with the current v3.0 guide, often pushing the correct answer down the ranking.
Manual curation doesn't scale. Determining which documents are still relevant requires domain expertise and constant attention. Most organizations don't have the resources, so they either delete nothing (index bloat) or delete aggressively (losing valuable historical context).
The deeper issue is that static indexes have no concept of "importance" beyond the embedding similarity score. A document retrieved once three years ago looks identical to a document retrieved daily. The system lacks memory about memory.
Cognitive foundations
ARM draws from two established principles in cognitive psychology.
| Computer (Traditional RAG) | Human (ARM) |
|---|---|
| Everything stays forever | Use it or lose it |
| All items equal priority | Important = stronger |
| Linear growth until crash | Self-regulating size |
| Manual cleanup required | Automatic housekeeping |
ARM mimics the right column. That's the key insight.
Ebbinghaus's forgetting curve
In 1885, Hermann Ebbinghaus conducted experiments on memory retention. He discovered that forgetting follows a predictable exponential decay. Without reinforcement, we lose approximately 50% of new information within an hour, 70% within a day, and 90% within a week.
Retention = e^(-t/S) where t is time since learning and S is the "stability" of the memory. Higher stability means slower forgetting. Each successful recall increases stability.
from math import exp
def forgetting_curve(days, stability=7.0):
"""
Calculate memory retention.
stability=7 means 50% forgotten after 7 days.
"""
return exp(-days / stability)
# Day 1: 87% retained
# Day 7: 37% retained
# Day 30: 1% retained (pruning territory)ARM applies this principle to document chunks. When a chunk is added to the index, it starts with low stability. If it's never retrieved, its "strength" decays over time. If it's retrieved frequently, its stability increases and it decays more slowly.
Picture a sandcastle on a restless shore. When you first build it, the walls are thin and the tide can wash away the sand in minutes. Each time a wave crashes and you rush to add a bucket of sand, the walls thicken and resist the next wave better. But if no one returns to repair it, the relentless tide slowly erodes the structure until it disappears beneath the surf. In ARM, retrievals are the buckets of sand. No retrievals? The tide wins.
The spacing effect
Ebbinghaus also discovered that spaced repetition is more effective than massed repetition. Reviewing information at increasing intervals produces better long-term retention than cramming.
A learning phenomenon where information reviewed after progressively longer gaps is retained longer than information crammed in a short burst. ARM uses this principle: a document retrieved once per week for a month builds more stability than one retrieved 10 times in a single day.
ARM uses this for memory consolidation. Documents that demonstrate consistent relevance across multiple sessions (spaced retrievals) get promoted to stable memory. Documents that only appear relevant in a single burst don't qualify for consolidation.
This prevents a single popular query from permanently inflating the importance of documents that happened to match it.
Architecture
ARM operates through three interconnected subsystems: the retrieval layer, the memory dynamics engine, and the consolidation scheduler.
ARM Architecture
Three interconnected subsystems manage adaptive memory
Retrieval layer
The retrieval layer handles standard RAG operations with one addition: it records metadata about every retrieval event.
When a query arrives:
- Dense retrieval finds candidate documents by embedding similarity
- Candidates are ranked by combined relevance score
- The top-k results pass to the generation model
- Retrieval metadata logs which chunks were selected
The logging captures: which chunks were retrieved, their rank positions, whether they contributed to the final answer, and the timestamp. This data feeds the memory dynamics engine.
Memory dynamics engine
The memory dynamics engine maintains a "strength" score for every chunk in the index. This score determines:
- How aggressively the chunk competes during retrieval
- Whether the chunk is eligible for consolidation
- When the chunk should be pruned
Strength updates follow the forgetting curve with modifications:
Decay: Every chunk loses strength over time according to its stability parameter. Low-stability chunks (rarely retrieved) decay quickly. High-stability chunks (frequently retrieved) decay slowly.
Reinforcement: When a chunk is retrieved and contributes to a useful response, its strength increases. The amount of increase depends on the retrieval rank (top-ranked retrievals matter more) and whether the chunk was novel (not recently retrieved).
Stability adjustment: Chunks that show consistent relevance across spaced time intervals get stability boosts. This is the spacing effect in action.
Consolidation scheduler
The consolidation scheduler is ARM's "sleep cycle"—a periodic background process that handles memory maintenance. It performs three critical functions:
1. Pruning: Removes chunks whose strength has decayed below τ_p (prune threshold). This is the forgetting mechanism in action. Chunks that haven't been retrieved within the grace period (γ steps) and have low strength get deleted from the index entirely.
2. Promotion: Elevates high-performing chunks to consolidated status. Once consolidated, a chunk's stability increases to maximum and it becomes immune to pruning. This is how the system builds durable long-term memory.
3. Statistics collection: Gathers retrieval metrics, strength distributions, and index health data. This feeds the interpretability layer—you can see exactly what the system remembers and why.
In cognitive psychology, consolidation is the process that converts short-term memories into stable long-term memories. In ARM, it means moving chunks from the dynamic memory pool (where they can decay) into permanent storage (where they're protected from pruning). Think of it as the system deciding: "This information has proven useful repeatedly. Keep it forever."
Consolidation decisions are fully interpretable. You can query the system to see why a particular chunk was promoted ("retrieved 47 times across 12 sessions with 89% contribution rate") or pruned ("zero retrievals in 90 days, strength decayed to 0.02"). This audit trail helps with debugging retrieval issues and satisfies governance requirements.
Think of the system as a seasonal garden. Seedlings sprout every season, but the gardener watches which ones bear fruit year after year. Those reliable producers are grafted onto sturdy rootstock, becoming permanent trees that keep producing without further tending. The seedlings that never flower are cleared away, making room for new growth. The consolidation scheduler is the gardener making these decisions on a regular cycle.
Memory dynamics
The interplay between decay, reinforcement, and consolidation creates emergent behavior that mimics human memory patterns.
The chart below shows three possible trajectories for a document chunk after ingestion. The gold line represents a "winner": a document that users keep retrieving. Each retrieval (marked by a dot) boosts its strength and increases its stability, slowing future decay. Eventually it crosses the consolidation threshold and becomes permanent.
The blue line shows a "one-hit wonder": retrieved once early on, then never again. That single retrieval gives it a brief boost, but without reinforcement, it fades steadily toward the pruning threshold.
The gray line is the most common path: a document that nobody ever retrieves. It decays exponentially from day one and gets pruned around day 60-90 when its strength drops below threshold.
Memory Strength Over Time
Frequently retrieved chunks consolidate; unused chunks fade
Here's what this looks like with actual numbers. Take a chunk containing "The API endpoint GET /users returns a JSON list":
| Day | Event | Strength Before | Strength After |
|---|---|---|---|
| 0 | Ingested | — | 0.30 |
| 5 | Retrieved (rank 1) | 0.20 (decayed) | 0.70 (+0.50 boost) |
| 12 | No retrieval | 0.70 | 0.26 (decayed) |
| 20 | Retrieved (rank 3) | 0.09 (decayed) | 0.34 (+0.25 boost) |
| 30 | No retrieval | 0.34 | 0.07 (decayed) |
| 45 | No retrieval | 0.07 | 0.01 → pruned |
This chunk was retrieved twice but never with enough spacing to build stability. It follows the "one-hit wonder" trajectory. In contrast, a chunk from the "2024 security policy" retrieved across 10 separate sessions over three months would see its stability grow, strength stay above 0.85, and reach consolidated status by day 90.
What does this look like in practice? The snapshot visualization below shows the same index at three points in time. On Day 1, every document is bright and equal. Nothing has proven itself yet. By Day 30, patterns emerge: some dots glow gold (frequently retrieved), others dim (rarely used), and a few are already fading to invisibility. By Day 90, the index has self-organized. A small core of consolidated documents (green with rings) remains permanently. The noise has been pruned away.
Memory Evolution Over Time
Watch the index self-organize as usage patterns emerge
Retrieval-dependent survival
Chunks that get retrieved survive. Chunks that don't get retrieved fade. This creates natural selection pressure where only relevant content persists.
The key insight is that relevance is determined empirically, not theoretically. Instead of guessing which documents might be useful, the system observes which documents actually are useful. A document might look important based on its title or source. But if users never retrieve it when asking questions, ARM treats it as noise.
This is Darwinian selection applied to information retrieval. The "fittest" documents (those that consistently help answer queries) thrive. The rest fade away. No human curator needs to decide what's important. Usage patterns make that decision automatically.
Graceful degradation
Old but occasionally relevant content doesn't disappear immediately. The exponential decay curve gives it a long tail. If a document was heavily used in the past but is now rarely needed, it remains accessible at reduced priority for an extended period.
This handles the "just in case" scenario better than hard deletion. Historical context remains available for rare queries that need it, but doesn't clutter everyday retrievals.
Self-regulation
The memory substrate expands and contracts automatically based on usage patterns:
- Heavy usage periods → more content reaches consolidation threshold → stable memory grows
- Light usage periods → decay outpaces reinforcement → index shrinks
- Topical shifts → old topics fade as new topics consolidate → smooth transition
No manual intervention required. The system adapts to changing information needs.
The Consolidation Funnel
From noise to signal: only essential knowledge survives
Benchmark results
ARM was evaluated on standard information retrieval benchmarks with a focus on efficiency.
Parameter Efficiency Comparison
ARM achieves near-SOTA quality with 5x fewer parameters
Retrieval quality
The headline numbers are strong:
| Metric | ARM | Static RAG |
|---|---|---|
| NDCG@5 | 0.940 | 0.912 |
| Recall@5 | 1.000 | 0.967 |
| MRR | 0.891 | 0.854 |
NDCG@5 of 0.940 means the system almost always ranks the most relevant documents at the top. Perfect Recall@5 means it never misses relevant documents in the top 5 results.
What this means for users:
- Before (Static RAG): Recall@5 = 0.967 means roughly 1 in 30 relevant answers falls outside the top 5. The best answer lands at position ~2.3 on average.
- After (ARM): Recall@5 = 1.0 means zero missed answers. The correct answer moves to position ~1.2, meaning users see the right document in the first or second slot almost every time.
NDCG@5 = 0.940: 94% of the time, the best possible answer is in the top 5 results, ranked correctly by importance. That's the number executives care about.
Recall@5 = 1.0: The system never misses. If the answer exists in your documents, it finds it.
MRR = 0.891: The correct answer is usually the first or second result, not buried at position 5.
Efficiency gains
The more interesting story is parameter efficiency. ARM offers the best efficiency among ultra-efficient models (those under 25M parameters):
| System | Embedding Parameters | NDCG@5 | Efficiency Class |
|---|---|---|---|
| ARM | 22M | 0.940 | Ultra-efficient |
| Dense Passage Retriever | 110M | 0.923 | Standard |
| Contriever | 110M | 0.918 | Standard |
| ColBERT | 110M | 0.932 | Standard |
ARM achieves near-SOTA quality with 5x fewer parameters. Most retrieval systems in this benchmark class need 110M+ parameters. ARM does the same job with 22M, meaning it uses 1/5th the "brain power" while actually outperforming larger models.
ARM isn't just smaller—it's in a different efficiency class entirely. At 22M parameters, it fits the "ultra-efficient" category where most models struggle to match standard retriever quality. ARM breaks this pattern by achieving higher NDCG than 110M-parameter competitors. This makes it viable for edge deployment, mobile apps, and resource-constrained environments where standard retrievers simply won't fit.
Why this matters: smaller models mean faster queries (less computation per search), lower memory footprint (fits on cheaper hardware), and reduced hosting costs. For a production system handling millions of queries per day, this translates to thousands of dollars saved monthly.
Efficiency in practice:
| Metric | Standard Retriever (110M) | ARM (22M) |
|---|---|---|
| Query latency | ~120ms | ~30ms |
| VRAM required | ≥8 GB | ≤2 GB |
| Cost per query | ~$0.12 | ~$0.03 |
Same or better quality at 75% lower latency and 75% lower cost. This is what makes ARM viable for edge and mobile deployments where standard retrievers simply don't fit.
Model comparisons
The paper also evaluates generation quality with different LLM backends:
| Configuration | Avg. Response Time | Key-Term Coverage |
|---|---|---|
| Llama 3.1 + Static RAG | 12.4s | 67.2% |
| GPT-4o + ARM | 8.2s | 58.7% |
| Llama 3.1 + ARM | 11.1s | 71.3% |
GPT-4o with ARM delivers the fastest responses. Llama 3.1 with ARM achieves the highest coverage of key terms from reference answers. The ARM memory layer improves both configurations compared to static RAG.
Implementation blueprint
ARM's architecture can be implemented on top of existing RAG infrastructure with moderate modifications.
Recommended stack
| Component | Recommended | Notes |
|---|---|---|
| Vector DB | Milvus, Qdrant | Need support for metadata filtering and bulk updates |
| Metadata Store | PostgreSQL, Redis | Tracks strength scores and retrieval history |
| Scheduler | Celery, Temporal | Runs periodic consolidation jobs |
| Embeddings | E5-small, BGE-small | Small models sufficient given ARM's efficiency |
Core data structures
Every chunk needs additional metadata beyond standard RAG:
class ARMChunk:
id: str
embedding: List[float]
content: str
# ARM-specific
strength: float # Current retrieval priority
stability: float # Decay rate modifier
created_at: datetime
last_retrieved: datetime
retrieval_count: int
session_count: int # Unique sessions
consolidated: bool # Protected from pruningStrength update algorithm
After each retrieval, update chunk strengths:
def update_strength(chunk, rank, contributed):
# Time-based decay
days_since = (now() - chunk.last_retrieved).days
decay = exp(-days_since / chunk.stability)
chunk.strength *= decay
# Retrieval reinforcement
if contributed:
boost = 1.0 / (rank + 1) # Top ranks boost more
chunk.strength += boost
chunk.retrieval_count += 1
# Spacing effect: boost stability for spaced retrievals
if days_since > 7:
chunk.stability *= 1.1
chunk.last_retrieved = now()Consolidation logic
Run periodically (daily or weekly):
def consolidate():
for chunk in get_all_chunks():
# Prune if too weak
if chunk.strength < PRUNE_THRESHOLD:
if not chunk.consolidated:
delete_chunk(chunk)
continue
# Consolidate if consistently strong
if (chunk.strength > CONSOLIDATE_THRESHOLD and
chunk.session_count > MIN_SESSIONS and
not chunk.consolidated):
chunk.consolidated = True
chunk.stability = MAX_STABILITYKey parameters
These values produced the benchmark results. The paper uses Greek notation for precision:
| Parameter | Symbol | Value | Purpose |
|---|---|---|---|
| Decay rate | α | 0.95 | Base decay multiplier per time step |
| Remembrance threshold | θ | 3 | Retrieval count needed to boost stability |
| Grace steps | γ | 5 | New chunks protected from pruning |
| Initial stability | S₀ | 7.0 | Days until 50% decay |
| Prune threshold | τ_p | 0.05 | Minimum strength to survive |
| Consolidate threshold | τ_c | 0.8 | Strength needed for promotion |
| Min sessions | n_s | 5 | Sessions before consolidation eligible |
Unlike black-box neural approaches, ARM's parameters have clear semantics. If chunks are getting pruned too fast, increase γ (grace steps). If the index grows too large, decrease θ (remembrance threshold). You can tune the system's "memory personality" without retraining anything.
Pitfalls to avoid
The difficulty of making good retention decisions for newly added documents that have no retrieval history yet. Without enough interactions, useful content can be mistakenly pruned before it proves its value. ARM addresses this with a grace period (γ steps) where new chunks are protected.
Cold start problem. New chunks have low stability and may get pruned before proving their worth. The grace period parameter (γ = 5 by default) protects new chunks from immediate pruning.
Query distribution shifts. If usage patterns change suddenly (new product launch, reorg), the memory dynamics may lag. Consider manual stability boosts for known-important new content.
Over-aggressive pruning. Setting thresholds too high can delete useful content. Monitor pruning rates and retrieval quality together. If quality drops after pruning spikes, lower the threshold.
Consolidation storms. If many chunks qualify for consolidation simultaneously, batch processing can spike resource usage. Rate-limit consolidation operations.
Practical applications
ARM's self-regulating memory suits specific deployment scenarios better than others. The key benefit: no more Knowledge Manager role. The system does its own housekeeping.
When to use ARM vs. static RAG
| Use ARM when... | Use Static RAG when... |
|---|---|
| Content changes frequently (news, support) | Archive must be complete (legal, compliance) |
| Index grows unbounded over time | Collection size is fixed |
| No dedicated curation team | Dedicated librarians maintain content |
| Relevance shifts with user behavior | All content equally important forever |
| Budget constrains compute costs | Compute budget is unlimited |
Long-running assistants
Personal AI assistants and customer support chatbots accumulate context over months or years. ARM keeps their knowledge base fresh without manual curation.
A support bot that started with v1.0 documentation naturally transitions to v3.0 as users stop asking about deprecated features. No migration project required. No quarterly "knowledge base cleanup" sprint.
Resource-constrained deployments
Edge deployments and mobile applications face memory limits. ARM's efficient parameter count (22M vs 110M+) and automatic pruning keep the footprint manageable.
An on-device assistant can maintain useful context within a fixed memory budget as the user's needs evolve.
Dynamic document collections
News organizations, research teams, and market intelligence platforms deal with constantly changing content. ARM adapts to topical shifts without reindexing.
Stories that were relevant last week naturally fade. Emerging topics gain priority through retrieval reinforcement.
Privacy and compliance
ARM's natural decay is a privacy feature, not just an efficiency trick. Data that users stop accessing gradually disappears from the active index, reducing your liability surface area for old PII.
This aligns with GDPR's data minimization principle: don't keep what you don't need. With ARM, the system enforces this automatically. Documents containing sensitive information from years ago fade if nobody queries them, without requiring manual deletion campaigns.
The interpretability of ARM's memory decisions helps with data governance. You can audit why specific documents were retained or removed, supporting regulatory requirements around data retention.
Limitations
ARM introduces complexity that static RAG systems don't have.
Additional infrastructure
The metadata store, scheduler, and consolidation jobs add operational overhead. Teams already stretched thin may find this burdensome.
Hyperparameter sensitivity
The stability, threshold, and decay parameters require tuning for each deployment. The paper provides starting values, but production environments may differ significantly.
Cold start challenges
New deployments start with no retrieval history. ARM can't make intelligent retention decisions until usage patterns emerge. The first few weeks behave like static RAG.
Evaluation complexity
Standard RAG benchmarks assume static document collections. Evaluating a system that actively modifies its index requires new methodology the paper doesn't fully address.
Single-author work
The paper comes from a solo researcher without institutional backing. While the implementation may lack the polish of larger team efforts, the underlying principles (Ebbinghaus's forgetting curve, the spacing effect) are century-old proven science with decades of validation in cognitive psychology. ARM's novelty is in application, not theory. The risk is in engineering details, not foundational concepts.
Related research: ARM focuses on selecting which memories to keep. For a complementary approach that focuses on compressing memories, see SimpleMem, which consolidates agent experiences into compact, retrievable summaries. ARM burns the hay; SimpleMem compresses it. Both solve the scaling problem differently.
Author: Okan Bursa
Cite this paper
Okan Bursa (2026). ARM: Teaching RAG Systems to Forget Like Humans. arXiv 2026.