Recursive Language Models: Processing Unlimited Context Through Code

TL;DR

Problem. LLMs degrade on long inputs even within their context windows, and fail entirely beyond them
Solution. RLMs store the prompt in a Python REPL and let the model write code to access it, calling sub-models recursively on smaller pieces
Results. On four benchmarks, RLMs outperformed base models by 12-58 percentage points and handled inputs up to 10M+ tokens

Research overview

Every LLM has a context window limit. Even frontier models hit ceilings when you need to analyze a codebase, review a legal discovery dump, or process a year of support tickets. Real-world documents routinely exceed millions of tokens.

The standard solutions all involve trade-offs. Chunking loses cross-document relationships. Summarization discards details. RAG retrieves relevant pieces but misses context that seemed irrelevant at retrieval time.

The out-of-core inspiration

Database systems solved a similar problem decades ago. When datasets exceed RAM, systems use "out-of-core" algorithms: data lives on disk, and the program cleverly fetches only what it needs into fast memory. RLMs apply this same principle to LLMs. The massive context lives outside the model, accessed through code rather than crammed into the context window.

What is a context window?

The context window is the maximum amount of text an LLM can "see" at once. Think of it like working memory. GPT-5, for example, has a 272K token context window. Anything beyond that limit simply cannot be included in a single prompt.

Recursive Language Models take a different approach. Instead of cramming everything into the context window, RLMs treat the prompt as an external object that the model interacts with through code. The model can peek at sections, search with regex, partition into chunks, and recursively call itself (or smaller models) on manageable pieces.

The key insight: the model doesn't need to see everything at once. It needs the ability to programmatically access anything on demand.

The context window problem

Context windows create a hard ceiling on what LLMs can process. But the problem runs deeper than raw token counts.

What is "context rot"?

Even within their context windows, LLMs struggle with very long inputs. Performance degrades as context length increases. The model "forgets" information from earlier in the prompt, especially in the middle sections. This phenomenon, sometimes called the "lost in the middle" problem, means that larger context windows don't fully solve long-document processing.

Common approaches (background)

The paper situates RLMs against several existing strategies. These are general approaches in the field, not all directly evaluated in the paper:

Context compaction / summarization: The paper directly compares against a "summary agent" baseline that iteratively summarizes context as it fills. This approach loses detail when the summary omits needed facts.

Retrieval-augmented approaches: The paper evaluates a CodeAct agent with BM25 retrieval. Retrieval works for needle-in-haystack but struggles when relevance isn't obvious at query time.

Extended context models: Training models with longer windows helps but doesn't eliminate context rot, the degradation the paper documents in Figure 1.

RLMs take a different approach: the context doesn't need to fit in the window because the model accesses it programmatically.

Why task complexity matters

Context rot doesn't affect all tasks equally. The paper's key insight: performance degradation depends on how much information the task requires you to process.

Performance vs Context Length

RLM maintains accuracy while base models degrade (data approximated from Figure 1)

Chart: Performance vs context length. X-axis shows context length from 8K to 256K tokens (all within GPT-5's 272K limit). Y-axis shows accuracy. GPT-5 (gray) degrades as context grows due to task complexity, while RLM (gold) stays stable. Data approximated from Figure 1.

The paper characterizes tasks by "information density." As an intuitive simplification, think of it this way:

Information density

How much of the input must be processed to produce the correct answer. A needle-in-haystack task has low density (find one fact). An aggregation task has high density (process every entry). Higher density means context rot hits harder, even on shorter inputs.

Task	Intuitive scaling	What it requires
S-NIAH	Constant	Find one needle; needle size stays fixed as haystack grows
OOLONG	Linear	Transform and aggregate every entry in the dataset
OOLONG-Pairs	Quadratic	Compare all pairs of entries

This explains why harder tasks fail at shorter context lengths. On OOLONG-Pairs, GPT-5 drops below 1% accuracy at 32K tokens, well within its context window. The model can fit the input, but the task complexity overwhelms it.

RLMs decompose the problem. Instead of processing all pairs in one pass, the model partitions the data and processes chunks recursively. Think of a librarian cataloging a vast library: rather than surveying every book at once, they work room by room, shelf by shelf, combining results as they go. RLMs apply the same divide-and-conquer logic to text.

How RLMs work

The architecture has three components:

A Python REPL environment where the prompt is stored as a string variable
A root LLM that writes code to interact with this environment
A sub-model (often smaller and cheaper) that the root LLM can invoke for recursive queries

RLM Execution Flow

How the model processes unlimited context through code

Chart: RLM execution flow. The context lives in the REPL environment, not in the model's context window. The root LLM writes code to access it, calling sub-models for recursive queries.

The execution loop

When you give an RLM a query over a long document:

The document loads into the REPL as a variable (e.g., context = "...")
The root LLM receives a system prompt describing the REPL environment, metadata about the context (like its length), and the user's query
The LLM writes Python code to examine the context (peek at sections, search, partition)
Code executes in the REPL; output returns to the LLM
The LLM can call llm_query(subset, question) to recursively query sub-models on smaller pieces
The process continues until the LLM produces a final answer

Why Python specifically?

Python provides string manipulation (slicing, regex), control flow (loops, conditionals), and the ability to define functions. The REPL maintains state across turns. In principle, any programming environment would work, but Python's ubiquity means LLMs have extensive training data for it.

What the model never sees

The root LLM never receives the full document in its context window. It sees:

A system prompt describing the REPL environment
Metadata about the context (e.g., its length)
The user's query
Code it has written
Output from code execution (search results, samples, sub-model responses)

The massive context exists only as a variable in the REPL, accessed through code.

Key strategies

RLMs weren't explicitly trained to handle long contexts. Yet they discover effective strategies through the code-writing interface. The paper documents several recurring patterns in its trajectory analysis.

The code snippets below are illustrative examples of the patterns observed, not exact reproductions from the paper.

Filtering with regex

For needle-in-haystack tasks, models use regex to narrow the search space:

import re
matches = re.findall(r'festival.*La Union', context)
print(matches[:10])

The paper shows RLM(GPT-5) using regex to search for keywords from the query and phrases it has priors about.

Peeking at structure

Before processing, models sample the context to understand its structure:

print(context[:1000])
print(f"Total length: {len(context)}")

This lets the model plan its strategy without consuming context on the full document.

Chunking and recursive sub-calls

For aggregation tasks, models split the context and process pieces recursively:

chunks = context.split('\n')
results = []
for chunk in chunks:
    result = llm_query(chunk, "Classify this entry")
    results.append(result)

The paper observed RLM(Qwen3-Coder) chunking by newline on OOLONG tasks.

Variable passing for long outputs

On tasks requiring long outputs, models store sub-call results in variables and stitch them together:

results = []
for chunk in chunks:
    result = llm_query(chunk, query)
    results.append(result)
final_output = "\n".join(results)

The paper explicitly observed this pattern in OOLONG-Pairs trajectories.

Answer verification

Models sometimes verify answers by querying specific sections:

verification = llm_query(relevant_section, f"Does this confirm {answer}?")

The paper notes this can help but also cause redundant verification loops that increase cost.

Benchmark results

The paper evaluates RLMs on four benchmarks spanning different complexity classes.

RLM Performance vs Baselines

Recursive approach outperforms selected baselines across benchmarks

Chart: RLM vs baseline accuracy across benchmarks. Each bar pair shows RLM (gold) versus a selected baseline (gray). RLM outperforms on all four tasks, with the largest gains on information-dense tasks like OOLONG-Pairs.

S-NIAH: Single needle-in-haystack

Find a specific phrase hidden in a large document (based on RULER benchmark). Constant complexity (the answer exists in one location).

RLMs solve this efficiently through grepping. Performance remains stable regardless of document length because the search strategy doesn't change.

BrowseComp-Plus: Multi-hop QA

Answer questions requiring information from multiple documents. The benchmark provides 1,000 documents per query, with total context of 6-11M tokens. Based on DeepResearch evaluation from the paper.

Cost per Query on 10M Token Documents

RLM achieves better accuracy at lower cost (costs estimated from paper)

Chart: Cost per query on BrowseComp-Plus. The x-axis shows cost in USD, y-axis shows method. RLM achieves 91% accuracy at $0.99 average cost. Comparison costs are estimated from the paper.

RLMs perform well here because they can iteratively search, retrieve relevant documents, and synthesize information across them. The paper reports RLM at $0.99 average cost versus an estimated $1.50-$2.75 for direct context ingestion, while achieving 91.3% accuracy compared to 51% for CodeAct with BM25.

OOLONG: Semantic aggregation

Transform and aggregate entries from a dataset (from the OOLONG benchmark). Linear complexity (must process each entry once). The benchmark uses 131K tokens of input context.

Method	Score
GPT-5 direct	44%
GPT-5 RLM (with GPT-5-mini sub-calls)	56.5%

The recursive approach outperformed direct GPT-5 by 12.5 percentage points (56.5% vs 44%). The RLM uses GPT-5 as the root model and GPT-5-mini for cheaper sub-calls, striking a balance between capability and cost.

OOLONG-Pairs: Pairwise aggregation

Compare all pairs of entries. Quadratic complexity (n entries means n² comparisons). This is a custom benchmark the authors created from OOLONG.

Base models completely fail on this task (0.04% F1 for GPT-5). There's no way to fit all pairwise comparisons in a context window.

RLMs achieve 58% F1 by decomposing the problem. The paper observed models chunking by newline, storing sub-call outputs in variables, and stitching results together to form the final answer.

LongBench-v2 CodeQA

The fourth benchmark tests code repository understanding (from LongBench-v2): multiple-choice questions requiring reasoning across files in a codebase. Input lengths range from 23K to 4.2M tokens.

Method	Score
GPT-5 direct	24%*
Summary agent	58%
GPT-5 RLM	62%

*Context limit exceeded on some tasks

RLM achieves 62% accuracy compared to 58% for the summarization baseline. The model uses its code environment to grep for function definitions, trace imports, and understand file relationships.

What each component contributes

The paper includes an ablation: "RLM with REPL, no sub-calls." In this variant, the model still has access to the REPL environment and can write code to examine the context, but cannot call sub-models. This isolates the benefit of the programming environment from recursive decomposition.

Method (GPT-5)	OOLONG	OOLONG-Pairs	BrowseComp+
Base model	44%	0.04%	Fails*
RLM (REPL only, no sub-calls)	36%	43.9%	88%
RLM (full, with sub-calls)	56.5%	58%	91.3%

*Context limit exceeded (6-11M tokens)

The ablation shows both components matter, but in different ways:

OOLONG: Sub-calls are essential. Without them, performance drops from 56.5% to 36% because the task requires semantic transformation of each entry.
OOLONG-Pairs: The REPL alone achieves 43.9% (vs 0.04% base) by using code-based processing, but sub-calls push it to 58%.
BrowseComp+: Filtering is key. The REPL-only variant reaches 88%, close to the full 91.3%.

When do you need sub-calls?

For search and filtering tasks (needle-in-haystack, code navigation), the REPL alone may suffice. For tasks requiring semantic transformation of each element (aggregation, pairwise comparison), recursive sub-calls provide the additional gains.

Implementation blueprint

This section combines findings from the paper with practitioner guidance. Items marked with * are extrapolations not explicitly stated in the paper.

Recommended stack (practitioner guidance*)

Component	Options	Notes
REPL environment	Python subprocess, Jupyter kernel	Paper uses Python REPL. Need persistent state across turns.
Root model	GPT-5, Qwen3-Coder, or similar	Paper uses these. Must have strong code generation.
Sub-model	GPT-5-mini or smaller	Paper uses GPT-5-mini for GPT-5 experiments. Can be cheaper than root.
Orchestration	Custom implementation	Paper provides no framework recommendation.

Core implementation (practitioner guidance*)

A minimal RLM needs three pieces:

1. REPL wrapper

class PythonREPL:
    def __init__(self):
        self.globals = {}
 
    def execute(self, code: str) -> str:
        try:
            exec(code, self.globals)
            # Capture print output
            return captured_output
        except Exception as e:
            return f"Error: {e}"

2. Sub-model function

def llm_query(context: str, question: str) -> str:
    """Available to the root model via the REPL"""
    response = sub_model.complete(
        f"Context: {context}\n\nQuestion: {question}"
    )
    return response.text

3. Root model loop

def rlm_query(document: str, question: str) -> str:
    repl = PythonREPL()
    repl.globals['context'] = document
    repl.globals['llm_query'] = llm_query
 
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": question}
    ]
 
    while True:
        response = root_model.complete(messages)
 
        if response.contains_final_answer:
            return response.answer
 
        # Execute any code blocks
        code_output = repl.execute(response.code)
        messages.append({"role": "assistant", "content": response.text})
        messages.append({"role": "user", "content": f"Output: {code_output}"})

The system prompt

The paper notes that the system prompt is fixed across experiments (see Appendix D). An illustrative structure:

You have access to a Python REPL with:
- `context`: A string variable containing the full document
- `llm_query(text, question)`: Call a sub-model on a subset of text

Write Python code to analyze the context and answer the user's question.
Print intermediate results to see them.
When ready, output your final answer.

The paper found that the prompt needed model-specific tuning. For Qwen3-Coder, they added a warning against excessive sub-calls.

Key parameters

Parameter	Value	Source
Recursion depth	1 level	Paper: sub-models are LMs, not RLMs
Max iterations	Varies by task	Paper: trajectories vary widely in length
Chunk size	Model-dependent	Practitioner default: stay well under sub-model context limit
Cost limits	Essential	Paper notes large tail costs (see Figure 3)

The paper explicitly uses recursion depth of 1 (sub-calls go to regular LLMs, not RLMs). Other parameters like chunk size weren't specified; implementers should tune based on their sub-model's context window.

Common pitfalls (practitioner guidance*)

1. Runaway costs

Models can generate many sub-queries. The paper notes high variance at the 95th percentile. Consider:

Hard limits on sub-model calls per query
Cost tracking and alerts

2. Infinite loops

Models sometimes get stuck repeating strategies. Add iteration limits and detection for repeated code patterns.

3. Code execution safety

The REPL executes arbitrary Python. Standard sandboxing practices apply.

4. Context window overflow

Even with RLM, the root model's context accumulates code and outputs. Monitor token count.

Lessons from the paper

The paper reports several practical findings:

Model-specific prompts are necessary. The authors used one RLM prompt for both GPT-5 and Qwen3-Coder initially. Qwen3 made thousands of sub-calls per query, exploding costs. They added a warning against excessive sub-calls, which reduced the excessive sub-call behavior.

Strong coding ability is required. The paper evaluated frontier models. Smaller models may lack the coding ability to write correct Python for context manipulation.

Sub-calling patterns vary by model. GPT-5 is conservative with sub-calls (tens per query). Qwen3-Coder makes hundreds to thousands. Neither is universally better; task characteristics determine which works.

Final answer detection needs work. The current approach uses output parsing to detect completion. Models sometimes output plans as final answers, or loop trying to verify correct answers. The paper notes this as an area for improvement.

Practical applications

The paper evaluates RLMs on research benchmarks. The applications below are potential use cases that align with the evaluated task types, not empirically validated deployments.

Code repository analysis

Aligns with: LongBench-v2 CodeQA (code repository understanding)

RLMs could query entire codebases without pre-indexing, using the same grep-and-synthesize patterns observed in the paper's trajectory analysis.

Multi-document research

Aligns with: BrowseComp-Plus (multi-hop QA across documents)

The paper's strong results on BrowseComp-Plus suggest RLMs could handle research synthesis across large document collections.

Data aggregation tasks

Aligns with: OOLONG (semantic aggregation)

Tasks requiring transformation and aggregation of structured data, like processing logs or survey responses, match the OOLONG task structure.

When to use RLMs (and when not to)

The paper provides guidance on when to use RLMs.

Use RLMs when:

Input exceeds the model's context window
The task requires processing most of the input (aggregation, pairwise comparison)
You need to iteratively search and synthesize across a large corpus
Cost efficiency matters at scale (RLM can be cheaper than direct ingestion for very long inputs)

Consider base models when:

Inputs fit comfortably in context (the paper found base LLMs outperform RLMs on small inputs)
The task is simple needle-in-haystack that the base model handles well
Predictable latency matters more than cost (RLM latency varies widely)
You need deterministic behavior (RLM trajectories are less predictable)

The tradeoff point

The paper observes that "the base LM outperforms RLM in the small input context regime." RLMs add overhead: REPL setup, code generation, sub-calls. For short inputs where the base model performs well, this overhead hurts more than it helps. The crossover point depends on the task and model, but expect RLMs to win primarily on inputs that stress or exceed context limits.

Limitations

The paper explicitly discusses these limitations:

Synchronous sub-calls

The current implementation runs sub-queries sequentially. The paper notes that "alternative strategies involving asynchronous sub-calls... can potentially significantly reduce the runtime and inference cost."

Single-level recursion

Sub-models are standard LMs, not RLMs. The paper states: "we chose to use a max recursion depth of one... we believe that future work should investigate deeper layers of recursion."

No explicit training

The models weren't trained specifically for RLM operation. The paper suggests that "explicitly training models to be used as RLMs... could provide additional performance improvements" and notes that "current models are inefficient decision makers over their context."

High cost variance

RLM trajectories vary widely in length depending on the task, leading to large tail costs (see Figure 3 in the paper). The paper notes this variance as a practical consideration for deployment.

Base models can outperform on small inputs

The paper notes: "we also observe that the base LM outperforms RLM in the small input context regime." RLMs add overhead that hurts on inputs where the base model already performs well.

Paper: arXiv:2512.24601

Authors: Alex L. Zhang, Tim Kraska, Omar Khattab (MIT CSAIL)

Blog post: alexzhang13.github.io

Original paper: arXiv ・ PDF ・ HTML

Authors

Alex L. ZhangMIT CSAIL,Tim KraskaMIT CSAIL,Omar KhattabMIT CSAIL

Cite this paper

Alex L. Zhang, Tim Kraska, Omar Khattab (2025). Recursive Language Models: Processing Unlimited Context Through Code. arXiv 2025.

Key Findings