CaveAgent: Transforming LLMs into Stateful Runtime Operators

TL;DR

Problem. LLM agents serialize everything to text (JSON function calls, printed outputs), causing context explosion, hallucination, and lost data fidelity on complex tasks.
Solution. CaveAgent splits the agent into two streams: a semantic stream (LLM reasoning) and a runtime stream (persistent Python kernel). Heavy data stays in the runtime as native objects; the LLM generates code to manipulate them.
Results. 28% fewer tokens and 38% fewer interaction steps on multi-turn tasks. On Tau2-bench (stateful workflows): +10.5% for Kimi K2 (Retail), +13.5% for Qwen3-Coder (Retail). On BFCL: +7.1% for DeepSeek-V3.2. Tested with DeepSeek, GPT, Claude, and Gemini.
Viability. Enables complex, data-heavy agentic applications that were previously impossible or too expensive due to context limits.

Business impact

The 28% token reduction translates directly to a 28% reduction in API costs for high-volume agentic workflows. For teams running thousands of agent sessions daily, this is the difference between viable and prohibitively expensive.

What does "stateful" mean?

A stateful system remembers past interactions and data changes across time. Think of how a web browser remembers you're logged in between page loads, or how a shopping cart persists as you browse. Traditional LLM agents are "stateless": each turn starts fresh, forcing everything to be re-explained in text. CaveAgent is "stateful": variables, objects, and computations persist across turns without re-serialization.

Research overview

This paper targets a common agent failure: real-world data doesn't fit in a text-only workflow.

You ask for a Q3 revenue chart. The agent loads a 10,000-row DataFrame, prints it into the prompt, and hits the context limit. Values get dropped or hallucinated, and the chart fails. This is the serialization bottleneck—data must be converted to text to pass through the LLM. Most agent frameworks hit it eventually.

CaveAgent addresses this by splitting the system into two streams: a semantic stream that writes code and a runtime stream that keeps Python objects alive across turns. The model sees names and types, not raw data. The runtime does the heavy lifting.

The result: 28% fewer tokens, 39% fewer steps on multi-turn tasks, with the biggest gains in state-heavy retail workflows.

Key Advantages of CaveAgent

A stateful runtime enabling persistent state, native Python execution, and efficient multi-agent coordination

What is serialization?

Converting a data structure (like a DataFrame) into text so it can pass through an LLM's context window. This process is lossy (precision errors), expensive (token costs), and error-prone (hallucination risks).

Why "Cave"?

The name references Plato's cave allegory. Traditional agents see only "shadows" (text representations) of the real objects. CaveAgent lets the agent interact with the actual objects in the runtime, not just their textual projections.

The serialization problem

Agent architectures have evolved from simple ReAct loops to sophisticated code-generating systems. Each generation solved some problems while keeping the same fundamental constraint: all data must flow through text.

Evolution of Agentic Tool Use

From stateless JSON function calls to stateful runtime management

ReAct introduced structured thinking. JSON function calling added reliable tool invocation. CodeAct let agents write Python instead of rigid schemas. But every approach still requires the LLM to "see" data by converting it to text in the context window.

Here's what that looks like in practice:

User: "What's the average price in my dataset?"

JSON Function Calling:
1. LLM generates: {"tool": "load_data"}
2. System returns: [full CSV as text in context]
3. LLM generates: {"tool": "calculate_avg"}
4. System returns: "Average: 42.5"

Problem: Step 2 dumps thousands of rows into
the context window. Token cost explodes.
Precision is lost in text conversion.

This text-centric design creates predictable failures. The paper identifies three modes that compound across multi-turn tasks:

Failure	Cause	Impact
Context Explosion	Data fills context window	Tasks fail
Hallucination	LLM invents values	Wrong results
Error Propagation	Bad calls corrupt state	Cascading failures

What is context explosion?

LLMs have a fixed "context window" (the maximum text they can process at once). When an agent serializes large datasets, API responses, or conversation history into text, this window fills up. Once full, either the task fails or old information gets truncated, losing critical context.

Why code-based agents aren't enough

Frameworks like CodeAct let agents write Python instead of JSON. This helps with complex logic, but they still hit the serialization wall:

# CodeAct approach
df = pd.read_csv(url)  # Data loaded
print(df.head())       # Must print to "see" data
# Output goes into context window as text

The agent can only perceive data through print() statements. Everything still flows through text. CaveAgent breaks this constraint.

Dual-stream architecture

CaveAgent maintains two parallel contexts that stay synchronized throughout the conversation. Think of it as separating "brain" from "hands": the Semantic Stream reasons, plans, and decides, while the Runtime Stream executes, stores, and computes. The brain never holds all the data—it just tells the hands what to do.

Dual-Stream Architecture: Semantic + Runtime

The LLM reasons with lightweight references while the kernel holds heavy data

Semantic Stream (what the LLM sees):

System prompt with function signatures (not implementations)
Variable names and types (not values)
Execution outputs (truncated if too long)
Conversation history

Runtime Stream (what actually executes):

Persistent IPython kernel
All variables as native Python objects
Full function implementations
Complete execution state

Here's a concrete example of how this works:

# Developer injects data into runtime
agent.inject("sales", large_dataframe)
 
# LLM only sees metadata in its context:
# "sales (DataFrame): 50K transactions"
 
# LLM generates code referencing the variable
# by name, without seeing the actual rows:
avg = sales["revenue"].mean()

The key insight: the LLM generates code that references variables by name, and the runtime resolves those references to actual objects. The LLM doesn't need 10,000 rows in context—it just writes df.mean() and the runtime handles it.

Think of it as a Digital Twin

The persistent runtime is a "Digital Twin" of your business environment. DataFrames mirror your databases, ML models represent your analytics capabilities, and connection objects link to your infrastructure. The agent doesn't just "chat" about your data; it operates on a live, synchronized representation of your systems. Changes persist, state accumulates, and the twin evolves with every interaction.

Why this works

The architecture exploits a key property of Python: everything is an object. Functions, classes, DataFrames, database connections—all can be:

Injected into the runtime namespace
Referenced by the LLM via variable names
Retrieved as native objects for downstream use

This transforms the LLM from text processor into runtime operator.

Lossless object retrieval

The "Retrieve" step returns native Python objects, not text. Your downstream code receives a DataFrame, a trained model, or a database cursor, not a string that needs parsing. This means the agent's output can be a binary file, a database record, or a live API connection. For data engineers, this eliminates the "output parsing" nightmare of converting LLM text back into usable data structures.

Object-oriented vs. process-oriented agents

Traditional agents are "process-oriented": execute step 1, then step 2, passing text between each step. CaveAgent is "object-oriented": the agent owns and manipulates persistent objects directly. This is the difference between scripting (run commands in sequence) and system architecture (manage stateful components). For senior developers, this shift explains why CaveAgent handles complexity better: it's not automating a workflow, it's operating a system.

Core mechanisms

Three mechanisms enable the dual-stream architecture:

Variable and function injection

Before the conversation starts, you inject Python objects directly into the runtime:

# Developer code
agent.inject("df", my_dataframe)
agent.inject("db", database_connection)
agent.inject("model", trained_ml_model)

The LLM sees only metadata:

Available Variables:
- df (DataFrame): Customer transactions, 50K rows
- db (DatabaseConnection): PostgreSQL connection
- model (RandomForestClassifier): Trained on 2024 data

The LLM can then write code like predictions = model.predict(df) without ever seeing the model weights or the DataFrame contents.

Dynamic context synchronization

The runtime is invisible to the LLM by default. To inspect state, the agent must generate code:

# Agent generates this to "look" at data
print(df.describe())
print(f"Rows: {len(df)}")

Think of code as a probe: the LLM reaches into the runtime and pulls only summaries into its context window. The LLM isn't blind—it's selective. This is why prompting agents to use print(df.describe()) instead of print(df) matters: you're teaching the agent to query efficiently rather than dump everything.

This enforces selective attention, keeping the semantic stream lightweight.

Observation shaping

If the agent accidentally prints too much (e.g., print(df) on a million rows), the system returns an error instead of flooding the context: "Output exceeded limit. Use summary methods." This teaches the agent to query efficiently.

Security via policy guardrails

Before executing any code, CaveAgent parses it into an Abstract Syntax Tree (AST) and validates against configurable policy rules—automated compliance checking for AI-generated code.

ImportRule: Blocks dangerous modules (os, subprocess)
FunctionRule: Prohibits eval(), exec()
AttributeRule: Prevents sandbox escapes via __builtins__

Violations return structured errors, allowing the agent to self-correct without crashing the session.

Enterprise policy guardrails

AST validation acts as an automated policy layer between the AI and your systems. Every line of code is checked against your security rules before execution, not after. For enterprise deployments, configure your blocklist to include network libraries (requests, urllib), file system operations (open, pathlib), and shell execution. The agent receives a clear error message and can self-correct, without ever touching restricted APIs. This is governance-by-design, not governance-by-hope.

Benchmark results

CaveAgent shows consistent gains across every benchmark tested. It matches or exceeds JSON function calling, with the biggest improvements on stateful, multi-turn tasks where serialization overhead compounds.

Tau2-bench: CaveAgent vs JSON Function Calling

Success rate (%) on multi-turn tool use tasks, 3 runs average

Tau2-bench: Multi-turn tool use

What is Tau2-bench?

A benchmark that evaluates LLM agents on multi-turn, stateful workflows like booking changes and retail refunds. Unlike single-turn benchmarks, it measures whether agents can maintain context, update data correctly, and complete sequences of dependent actions across many conversation turns.

Tau2-bench tests agents on realistic scenarios: modifying flight reservations, processing retail refunds. These require maintaining state consistency across many turns.

Model	Domain	JSON FC	CaveAgent	Gain
DeepSeek-V3.2	Airline	55.3%	60.0%	+4.7%
DeepSeek-V3.2	Retail	77.2%	81.9%	+4.7%
Qwen3-Coder	Retail	41.2%	54.7%	+13.5%
Gemini 3 Pro	Airline	61.3%	68.0%	+6.7%

Gains are largest in the Retail domain, which involves complex transaction state (shopping carts, refund policies). CaveAgent reduces the serialization errors that plague JSON-based agents on stateful tasks.

BFCL: Atomic function calling

What is BFCL?

The Berkeley Function Calling Leaderboard ranks LLMs on how accurately they invoke external tools via function-calling interfaces. It tests parameter precision, type correctness, and edge case handling. BFCL is a standard reference for comparing tool-use performance across models.

The Berkeley Function Calling Leaderboard tests precise tool invocation. CaveAgent matches or exceeds JSON function calling:

Model	JSON FC	CaveAgent
DeepSeek-V3.2	86.9%	94.0% (+7.1%)
Qwen3-Coder	89.8%	94.4% (+4.6%)
Claude Sonnet 4.5	94.4%	94.4% (tie)

Notable: DeepSeek-V3.2 without special prompting scores only 53.1% on parallel function calls (its training biases toward sequential execution). CaveAgent achieves 94.0% without prompt engineering because Python code naturally supports parallel operations.

What is parallel function calling?

When a user request requires multiple independent tool calls (e.g., "Get weather in NYC and SF"), an efficient agent should call both simultaneously rather than sequentially. JSON-based agents need explicit parallel schemas. In Python code, this is just two lines: nyc = get_weather("NYC") and sf = get_weather("SF").

Stateful management tests

The paper introduces custom benchmarks for state persistence—evaluating capabilities that standard function-calling benchmarks don't measure. These are CaveAgent-only results since JSON-based agents cannot be directly evaluated on persistent state manipulation.

Test	DeepSeek	Qwen3	Gemini
Type Proficiency	100%	96.5%	100%
Multi-Variable (25)	100%	90.9%	100%
Multi-Turn (40)	100%	81.3%	98.7%

DeepSeek-V3.2 maintains perfect accuracy across all categories, including 40-turn conversations with numerical precision requirements. Code-based state manipulation is reliable for long-horizon tasks.

Token efficiency

Token costs compound every turn. Conversation history grows linearly while the context window stays fixed. CaveAgent attacks this at the source: keep heavy data out of the context entirely.

Lower Total Cost of Ownership (TCO)

Token efficiency translates directly to lower TCO for AI systems. CaveAgent reduces token consumption by 28% and interaction steps by 38% on multi-turn tasks. For enterprises evaluating long-term AI investments, this compounds: fewer tokens per session, fewer sessions per task, lower infrastructure costs, and reduced operational complexity. TCO analysis should factor in both direct API costs and the indirect costs of failed tasks and retry loops.

Token Efficiency: CaveAgent vs Function Calling

Aggregated metrics across IoT, Finance, E-commerce tasks

The paper benchmarks three domains (IoT, Finance, E-commerce) requiring interdependent tool operations:

Metric	CaveAgent	JSON FC
Steps	145	236 (-38.6%)
Prompt Tok.	444K	660K (-32.7%)
Completion Tok.	59K	43K (+36.3%)
Total Tok.	504K	704K (-28.4%)
Success	100%	94.6% (+5.4%)

Why do completion tokens increase? Python code is more verbose than JSON schemas (for item in cart: vs {"tool": "iterate"}). But prompt tokens dominate total cost and accumulate every turn as conversation history grows. CaveAgent's 32.7% reduction in prompt tokens far outweighs the completion token increase.

Data-intensive scenario

For visualization tasks requiring chart data generation:

Method	Success	Tokens
CaveAgent	90%	405K
CodeAct	40%	1,000K
JSON FC	30%	662K

CodeAct fails because it must print chart data to the context for extraction. JSON FC fails from context overflow before task completion. CaveAgent retrieves computed data directly from runtime variables.

Implementation blueprint

Ready to build your own? The core pattern is simple: inject objects, let the LLM write code, execute in a persistent kernel, retrieve results. The official implementation is available at github.com/acodercat/cave-agent.

Developer workflow comparison

Old way (JSON Function Calling):

Define complex JSON schemas for each tool
Map schemas to function implementations
Parse text responses back into usable objects

CaveAgent way:

Define Python classes with native methods
Inject objects into the runtime
Let the agent call methods directly

No schema middleware layer. No parsing. The agent operates on your objects natively.

CaveAgent Workflow: Interleaved Execution

Timeline showing alternating Semantic Stream and Runtime Stream with variable persistence

Recommended stack

Based on the paper's evaluation:

Component	Choice
Runtime	IPython kernel (Jupyter-compatible)
LLM	DeepSeek-V3.2 (best results)
Sandbox	Docker container
Security	AST validation

Scale tip: Kernel pooling

For production deployments, use kernel pooling to eliminate the 1-2 second startup latency of initializing new Python environments. Pre-warm a pool of idle kernels that can be assigned to incoming requests instantly. This is the same pattern used by Jupyter Hub and cloud notebook services. Without pooling, cold start latency can exceed your LLM inference time.

Core workflow

The interaction loop follows this pattern:

1. INJECT: Load objects into runtime namespace
2. DESCRIBE: Generate metadata for LLM context
3. GENERATE: LLM produces Python code
4. VALIDATE: AST security check
5. EXECUTE: Run in persistent kernel
6. SHAPE: Truncate output if needed
7. FEEDBACK: Return observation to LLM
8. REPEAT: Until task complete

Key parameters

The paper uses model-specific temperature settings:

Model	Temperature
DeepSeek-V3.2	0.2
Qwen3-Coder	0.2
Kimi K2	0.6
GPT-5.1	1.0

Output length limits and max turn counts are configurable; the paper's stateful benchmarks use 20-turn scenarios.

Data structures

Core abstractions for a CaveAgent-style system:

interface InjectableVariable {
  name: string;
  value: any;
  description: string;
  type: string;
}
 
interface RuntimeState {
  namespace: Map<string, any>;
  history: ExecutionResult[];
}
 
interface ExecutionResult {
  code: string;
  stdout: string;
  stderr: string;
  success: boolean;
}

Migration from existing frameworks

No need to rewrite your tools. If you're using LangChain, AutoGen, or similar frameworks with JSON function calling, wrap existing functions in Python classes and inject them:

# Existing LangChain tool
@tool
def search_db(query: str) -> str:
    return db.execute(query)
 
# Wrap for CaveAgent
class DBTool:
    def search(self, q: str):
        return db.execute(q)
 
agent.inject("db_tool", DBTool())

The agent now calls db_tool.search("...") instead of a JSON schema. Your underlying implementation stays the same.

Common pitfalls

Watch for these issues:

1. Over-printing. Agents trained on CodeAct habits will print() everything. Use prompt guidance: "Access variables directly; only print summaries."

2. Security gaps. AST validation catches obvious attacks but not all risks. Run in sandboxed containers. Never inject objects with sensitive methods.

3. State confusion. If the agent doesn't realize a variable exists, it may re-create it. Keep variable descriptions updated after each turn.

4. Memory leaks. Long sessions accumulate variables. Implement periodic cleanup or let the agent explicitly delete unused objects.

Minimal example

The following snippet shows the basic injection-chat-retrieve pattern:

from caveagent import Agent
 
agent = Agent(model="deepseek-v3.2")
 
# Inject data
agent.inject(
    "sales",
    sales_df,
    desc="Q1-Q4 sales, 50K rows"
)
 
# Chat
response = agent.chat(
    "Average transaction by region?"
)
# Agent writes: result = sales.groupby...
 
# Retrieve native object
result = agent.retrieve("result")

Practical applications

The architecture excels in specific scenarios:

Data analysis pipelines

Benchmark: Data Query and Analysis (100% vs 80% baseline)

Load datasets once, run multiple analyses without re-serialization. The agent can build derived DataFrames, train models, and generate visualizations—all referencing persistent objects.

IoT and smart home control

Benchmark: Multi-turn stateful (100% accuracy over 40 turns)

Device states persist in runtime. "Turn off all lights except the kitchen" doesn't require re-querying every device state—the agent updates the persistent home_state object.

Financial calculations

Benchmark: Financial Account (100% accuracy on numerical precision)

Account balances, transaction histories, and calculations maintain exact precision. No floating-point errors from text serialization.

Deterministic accuracy for finance

CaveAgent provides deterministic accuracy because calculations happen in Python (a math engine) rather than the LLM (a prediction engine). When your agent computes balance = income - expenses, the result is mathematically exact, not a statistical guess. For fintech and compliance-sensitive applications, this is the difference between "usually right" and "provably correct."

Multi-agent coordination

Benchmark: Runtime-Mediated Multi-Agent Coordination (paper Section E)

CaveAgent's shared runtime enables coordination patterns that would require complex message-passing protocols in traditional systems.

Multi-Agent Coordination via Shared Runtime

Town simulation: agents share state through persistent variables

Multiple agents share a runtime namespace. Agent A updates shared_state["weather"] = "rainy"; Agent B sees it immediately without message passing.

Supply chain example: An Inventory Agent updates a shared stock_levels dictionary when shipments arrive. A Procurement Agent immediately sees the change and adjusts purchase orders—without a single chat message sent. A Pricing Agent reads the same stock_levels to trigger dynamic pricing. State flows replace message passing, eliminating latency and ambiguity.

This enables:

Supervisor agents injecting context into worker runtimes
Swarm coordination through shared variables
Implicit synchronization without explicit protocols

When to use CaveAgent

Good fit

Complex data structures. DataFrames, graphs, trained models—anything that loses fidelity when serialized to text.
Multi-turn state consistency. Shopping carts, device states, account balances. Where errors compound across turns.
Significant token costs. Long conversations with large data. The 28% reduction compounds.
Object handoff required. The output isn't text but a Python object for downstream processing.
Parallel tool calls. Code naturally expresses parallelism; JSON schemas require explicit parallel constructs.

Consider alternatives

Single-turn tasks. No state to persist. Runtime overhead isn't justified.
Security-critical environments. Code execution introduces attack surface. If you can't sandbox properly, JSON function calling is safer.
Weak code generation. Smaller models may generate buggy Python. CaveAgent amplifies coding ability, not compensates for its absence.
Latency-critical. The runtime adds startup time. For real-time chat, the overhead may be noticeable.

Limitations

Code generation quality

CaveAgent is only as good as the LLM's code generation. Models that hallucinate syntax or misuse APIs will fail. The paper shows best results with DeepSeek-V3.2 and Qwen3-Coder, both strong code models.

Future direction: Reinforcement Learning with Verifiable Rewards

CaveAgent's architecture enables a powerful training signal: you can programmatically verify if an agent's runtime state is correct without human labels. Did the DataFrame end up with the right values? Is the account balance accurate? This is the foundation for Reinforcement Learning with Verifiable Rewards (RLVR), the same technique behind reasoning models like DeepSeek-R1. CaveAgent could enable automated training pipelines where agents learn from execution outcomes, not human feedback.

Sandbox complexity

Running arbitrary Python requires robust sandboxing. AST validation catches common attacks, but determined adversaries may find bypasses. Production deployments need container isolation, resource limits, and network restrictions.

Cold start overhead

Each session initializes a Python kernel. For short interactions, setup time may exceed the task itself. Consider kernel pooling for high-throughput applications.

Debugging opacity

When things go wrong, the dual-stream architecture complicates debugging. The LLM's reasoning is in one stream; actual execution state is in another. Logging and observability require capturing both.

Model compatibility

While framework-agnostic in principle, prompting and code patterns are tuned for specific models. Adapting to new LLMs may require prompt engineering.

Paper: arXiv:2601.01569 Authors: Maohao Ran, Zhenglin Wan, Cooper Lin, et al. (HKUST, HKU, NUS, NTU) Original paper: arXiv ・ PDF ・ HTML

Authors

Maohao Ran, Zhenglin Wan, Cooper Lin, et al.HKUST, HKU, NUS, Nanyang Technological University

Cite this paper

Maohao Ran, Zhenglin Wan, Cooper Lin, et al. (2026). CaveAgent: Transforming LLMs into Stateful Runtime Operators. arXiv 2026.

Key Findings