-
Problem. LLM agents serialize everything to text (JSON function calls, printed outputs), causing context explosion, hallucination, and lost data fidelity on complex tasks.
-
Solution. CaveAgent splits the agent into two streams: a semantic stream (LLM reasoning) and a runtime stream (persistent Python kernel). Heavy data stays in the runtime as native objects; the LLM generates code to manipulate them.
-
Results. 28% fewer tokens and 38% fewer interaction steps on multi-turn tasks. On Tau2-bench (stateful workflows): +10.5% for Kimi K2 (Retail), +13.5% for Qwen3-Coder (Retail). On BFCL: +7.1% for DeepSeek-V3.2. Tested with DeepSeek, GPT, Claude, and Gemini.
-
Viability. Enables complex, data-heavy agentic applications that were previously impossible or too expensive due to context limits.
The 28% token reduction translates directly to a 28% reduction in API costs for high-volume agentic workflows. For teams running thousands of agent sessions daily, this is the difference between viable and prohibitively expensive.
A stateful system remembers past interactions and data changes across time. Think of how a web browser remembers you’re logged in between page loads, or how a shopping cart persists as you browse. Traditional LLM agents are “stateless”: each turn starts fresh, forcing everything to be re-explained in text. CaveAgent is “stateful”: variables, objects, and computations persist across turns without re-serialization.
Research overview
This paper targets a common agent failure: large, real-world data does not fit in a text-only workflow.
You ask for a Q3 revenue chart. The agent loads a 10,000-row DataFrame, prints it into the prompt, and hits the context limit. Values get dropped or hallucinated, and the chart fails.
This is the serialization bottleneck, where data must be converted to text to pass through the LLM. Most agent frameworks hit it eventually.
CaveAgent addresses this by splitting the system into two streams: a semantic stream that writes code and a runtime stream that keeps Python objects alive across turns. The model sees names and types, not raw data, and the runtime does the heavy lifting.
In the paper’s experiments, this design cuts total tokens by 28% and reduces steps by 39% on multi-turn tasks, with the biggest gains in state-heavy retail workflows.
Key Advantages of CaveAgent
Four capabilities that differentiate stateful runtime management
Converting a data structure (like a DataFrame) into text so it can pass through an LLM’s context window. This process is lossy (precision errors), expensive (token costs), and error-prone (hallucination risks).
The name references Plato’s cave allegory. Traditional agents see only “shadows” (text representations) of the real objects. CaveAgent lets the agent interact with the actual objects in the runtime, not just their textual projections.
The serialization problem
Agent architectures have evolved from simple ReAct loops to sophisticated code-generating systems. Each generation solved some problems but kept the same fundamental constraint: all data must flow through text.
Evolution of Agentic Tool Use
From text parsing to stateful runtime management
The progression from ReAct to JSON function calling to code-based agents like CodeAct improved expressiveness and reduced parsing errors. But every approach still requires the LLM to “see” data by converting it to text in the context window.
Consider how current agents handle a simple data query:
User: "What's the average price in my dataset?"
JSON Function Calling:
1. LLM generates: {"tool": "load_data"}
2. System returns: [full CSV as text in context]
3. LLM generates: {"tool": "calculate_avg"}
4. System returns: "Average: 42.5"
Problem: Step 2 dumps thousands of rows into
the context window. Token cost explodes.
Precision is lost in text conversion.
This text-centric design creates predictable failure patterns. The paper identifies three modes that compound in multi-turn tasks:
| Failure | Cause | Impact |
|---|---|---|
| Context Explosion | Data fills context window | Tasks fail |
| Hallucination | LLM invents values | Wrong results |
| Error Propagation | Bad calls corrupt state | Cascading failures |
LLMs have a fixed “context window” (the maximum text they can process at once). When an agent serializes large datasets, API responses, or conversation history into text, this window fills up. Once full, either the task fails or old information gets truncated, losing critical context.
Why code-based agents aren’t enough
Recent frameworks like CodeAct let agents write Python instead of JSON. This helps with complex logic, but they still hit the serialization wall:
# CodeAct approach
df = pd.read_csv(url) # Data loaded
print(df.head()) # Must print to "see" data
# Output goes into context window as text
The agent can only perceive data through print() statements. Everything still flows through text. CaveAgent breaks this constraint.
Dual-stream architecture
CaveAgent maintains two parallel contexts that stay synchronized throughout the conversation. Think of it as separating the “Brain” from the “Hands”: the Semantic Stream is the brain (reasoning, planning, deciding), while the Runtime Stream is the hands (executing, storing, computing). The brain never has to hold all the data; it just tells the hands what to do.
Dual-Stream Architecture: Semantic + Runtime
The LLM reasons with lightweight references while the kernel holds heavy data
Semantic Stream (what the LLM sees):
- System prompt with function signatures (not implementations)
- Variable names and types (not values)
- Execution outputs (truncated if too long)
- Conversation history
Runtime Stream (what actually executes):
- Persistent IPython kernel
- All variables as native Python objects
- Full function implementations
- Complete execution state
Here’s a concrete example of how this works:
# Developer injects data into runtime
agent.inject("sales", large_dataframe)
# LLM only sees metadata in its context:
# "sales (DataFrame): 50K transactions"
# LLM generates code referencing the variable
# by name, without seeing the actual rows:
avg = sales["revenue"].mean()
The key insight: the LLM generates code that references variables by name. The runtime resolves those references to actual objects. The LLM does not need to hold a 10,000-row DataFrame in context; it just writes df.mean() and the runtime handles it.
The persistent runtime is a “Digital Twin” of your business environment. DataFrames mirror your databases, ML models represent your analytics capabilities, and connection objects link to your infrastructure. The agent doesn’t just “chat” about your data; it operates on a live, synchronized representation of your systems. Changes persist, state accumulates, and the twin evolves with every interaction.
Why this works
The architecture exploits a key property of Python: everything is an object. Functions, classes, DataFrames, database connections. All can be:
- Injected into the runtime namespace
- Referenced by the LLM via variable names
- Retrieved as native objects for downstream use
This transforms the LLM from a text processor into a runtime operator.
The “Retrieve” step returns native Python objects, not text. Your downstream code receives a DataFrame, a trained model, or a database cursor, not a string that needs parsing. This means the agent’s output can be a binary file, a database record, or a live API connection. For data engineers, this eliminates the “output parsing” nightmare of converting LLM text back into usable data structures.
Traditional agents are “process-oriented”: execute step 1, then step 2, passing text between each step. CaveAgent is “object-oriented”: the agent owns and manipulates persistent objects directly. This is the difference between scripting (run commands in sequence) and system architecture (manage stateful components). For senior developers, this shift explains why CaveAgent handles complexity better: it’s not automating a workflow, it’s operating a system.
Core mechanisms
Three mechanisms enable the dual-stream architecture:
Variable and function injection
Before the conversation starts, you inject Python objects directly into the runtime:
# Developer code
agent.inject("df", my_dataframe)
agent.inject("db", database_connection)
agent.inject("model", trained_ml_model)
The LLM sees only metadata:
Available Variables:
- df (DataFrame): Customer transactions, 50K rows
- db (DatabaseConnection): PostgreSQL connection
- model (RandomForestClassifier): Trained on 2024 data
The LLM can then write code like predictions = model.predict(df) without ever seeing the model weights or the DataFrame contents.
Dynamic context synchronization
The runtime is “invisible” to the LLM by default. To inspect state, the agent must generate code:
# Agent generates this to "look" at data
print(df.describe())
print(f"Rows: {len(df)}")
Think of it like a remote control: the LLM uses code as a probe to “look” into the runtime, pulling only summaries into its context window. The LLM isn’t blind; it’s selective. This is why prompting agents to use print(df.describe()) instead of print(df) matters: you’re teaching the agent to query efficiently rather than dump everything.
This enforces selective attention. The agent pulls only what it needs into the token context, keeping the semantic stream lightweight.
If the agent accidentally prints too much (e.g., print(df) on a million rows), the system returns an error instead of flooding the context: “Output exceeded limit. Use summary methods.” This teaches the agent to query efficiently.
Security via automated policy guardrails
Before executing any code, CaveAgent parses it into an Abstract Syntax Tree (AST) and validates against configurable policy rules. Think of it as automated compliance checking for AI-generated code.
- ImportRule: Blocks dangerous modules (
os,subprocess) - FunctionRule: Prohibits
eval(),exec() - AttributeRule: Prevents sandbox escapes via
__builtins__
Violations return structured errors to the agent, allowing self-correction without crashing the session.
AST validation acts as an automated policy layer between the AI and your systems. Every line of code is checked against your security rules before execution, not after. For enterprise deployments, configure your blocklist to include network libraries (requests, urllib), file system operations (open, pathlib), and shell execution. The agent receives a clear error message and can self-correct, without ever touching restricted APIs. This is governance-by-design, not governance-by-hope.
Benchmark results
The results show consistent gains across every benchmark. CaveAgent matches or exceeds JSON function calling, with the biggest improvements on stateful, multi-turn tasks where serialization overhead compounds.
Tau2-bench: CaveAgent vs JSON Function Calling
Success rate (%) on multi-turn tool use tasks, 3 runs average
Tau2-bench: Multi-turn tool use
A benchmark that evaluates LLM agents on multi-turn, stateful workflows like booking changes and retail refunds. Unlike single-turn benchmarks, it measures whether agents can maintain context, update data correctly, and complete sequences of dependent actions across many conversation turns.
Tau2-bench tests agents on realistic scenarios: modifying flight reservations, processing retail refunds. These require maintaining state consistency across many turns.
| Model | Domain | JSON FC | CaveAgent | Gain |
|---|---|---|---|---|
| DeepSeek-V3.2 | Airline | 55.3% | 60.0% | +4.7% |
| DeepSeek-V3.2 | Retail | 77.2% | 81.9% | +4.7% |
| Qwen3-Coder | Retail | 41.2% | 54.7% | +13.5% |
| Gemini 3 Pro | Airline | 61.3% | 68.0% | +6.7% |
Key finding: Gains are largest in the Retail domain, which involves complex transaction state (shopping carts, refund policies). CaveAgent reduces the serialization errors that plague JSON-based agents on stateful tasks.
BFCL: Atomic function calling
The Berkeley Function Calling Leaderboard ranks LLMs on how accurately they invoke external tools via function-calling interfaces. It tests parameter precision, type correctness, and edge case handling. BFCL is a standard reference for comparing tool-use performance across models.
The Berkeley Function Calling Leaderboard tests precise tool invocation. CaveAgent matches or exceeds JSON function calling:
| Model | JSON FC | CaveAgent |
|---|---|---|
| DeepSeek-V3.2 | 86.9% | 94.0% (+7.1%) |
| Qwen3-Coder | 89.8% | 94.4% (+4.6%) |
| Claude Sonnet 4.5 | 94.4% | 94.4% (tie) |
Parallel execution insight: DeepSeek-V3.2 without special prompting scores only 53.1% on parallel function calls (its training biases toward sequential execution). CaveAgent achieves 94.0% without prompt engineering because Python code naturally supports parallel operations.
When a user request requires multiple independent tool calls (e.g., “Get weather in NYC and SF”), an efficient agent should call both simultaneously rather than sequentially. JSON-based agents need explicit parallel schemas. In Python code, this is just two lines: nyc = get_weather(“NYC”) and sf = get_weather(“SF”).
Stateful management tests
The paper introduces custom benchmarks for state persistence. These tests evaluate CaveAgent’s unique capabilities (variable persistence, multi-turn state tracking) that standard function-calling benchmarks don’t measure. Note that these are CaveAgent-only results since JSON-based agents cannot be directly evaluated on persistent state manipulation.
The following table shows accuracy rates for each test category:
| Test | DeepSeek | Qwen3 | Gemini |
|---|---|---|---|
| Type Proficiency | 100% | 96.5% | 100% |
| Multi-Variable (25) | 100% | 90.9% | 100% |
| Multi-Turn (40) | 100% | 81.3% | 98.7% |
DeepSeek-V3.2 maintains perfect accuracy across all categories, including 40-turn conversations with numerical precision requirements. This validates that code-based state manipulation is reliable for long-horizon tasks.
Token efficiency
Token costs compound every turn. In multi-step workflows, conversation history grows linearly while the context window stays fixed. CaveAgent attacks this problem at its source: keep heavy data out of the context entirely.
Token efficiency translates directly to lower TCO for AI systems. CaveAgent reduces token consumption by 28% and interaction steps by 38% on multi-turn tasks. For enterprises evaluating long-term AI investments, this compounds: fewer tokens per session, fewer sessions per task, lower infrastructure costs, and reduced operational complexity. TCO analysis should factor in both direct API costs and the indirect costs of failed tasks and retry loops.
Token Efficiency: 28% Reduction with CaveAgent
Aggregate across IoT, Finance, and E-commerce domains
The paper benchmarks three domains (IoT, Finance, E-commerce) requiring interdependent tool operations. The following table summarizes the aggregate results:
| Metric | CaveAgent | JSON FC |
|---|---|---|
| Steps | 145 | 236 (-38.6%) |
| Prompt Tok. | 444K | 660K (-32.7%) |
| Completion Tok. | 59K | 43K (+36.3%) |
| Total Tok. | 504K | 704K (-28.4%) |
| Success | 100% | 94.6% (+5.4%) |
Why completion tokens increase: Python code is more verbose than JSON schemas (for item in cart: vs {"tool": "iterate"}). But prompt tokens dominate total cost and accumulate every turn as conversation history grows. CaveAgent’s 32.7% reduction in prompt tokens far outweighs the completion token increase.
Data-intensive scenario
For visualization tasks requiring chart data generation:
| Method | Success | Tokens |
|---|---|---|
| CaveAgent | 90% | 405K |
| CodeAct | 40% | 1,000K |
| JSON FC | 30% | 662K |
CodeAct fails because it must print chart data to the context for extraction. JSON FC fails from context overflow before task completion. CaveAgent retrieves computed data directly from runtime variables.
Implementation blueprint
Ready to build your own? This section translates the paper’s implementation details into practical guidance. The core pattern is simple: inject objects, let the LLM write code, execute in a persistent kernel, retrieve results. The official implementation is available at github.com/acodercat/cave-agent.
Old way (JSON Function Calling):
- Define complex JSON schemas for each tool
- Map schemas to function implementations
- Parse text responses back into usable objects
CaveAgent way:
- Define Python classes with native methods
- Inject objects into the runtime
- Let the agent call methods directly
No schema middleware layer. No parsing. The agent operates on your objects natively.
CaveAgent Framework: The Complete Loop
From object injection to native retrieval
Recommended stack
Based on the paper’s evaluation, the following components form a solid foundation:
| Component | Choice |
|---|---|
| Runtime | IPython kernel (Jupyter-compatible) |
| LLM | DeepSeek-V3.2 (best results) |
| Sandbox | Docker container |
| Security | AST validation |
For production deployments, use kernel pooling to eliminate the 1-2 second startup latency of initializing new Python environments. Pre-warm a pool of idle kernels that can be assigned to incoming requests instantly. This is the same pattern used by Jupyter Hub and cloud notebook services. Without pooling, cold start latency can exceed your LLM inference time.
Core workflow
The interaction loop follows this pattern:
1. INJECT: Load objects into runtime namespace
2. DESCRIBE: Generate metadata for LLM context
3. GENERATE: LLM produces Python code
4. VALIDATE: AST security check
5. EXECUTE: Run in persistent kernel
6. SHAPE: Truncate output if needed
7. FEEDBACK: Return observation to LLM
8. REPEAT: Until task complete
Key parameters
The paper uses model-specific temperature settings:
| Model | Temperature |
|---|---|
| DeepSeek-V3.2 | 0.2 |
| Qwen3-Coder | 0.2 |
| Kimi K2 | 0.6 |
| GPT-5.1 | 1.0 |
Output length limits and max turn counts are configurable; the paper’s stateful benchmarks use 20-turn scenarios.
Data structures
If you’re building a CaveAgent-style system, these are the core abstractions you’ll need:
interface InjectableVariable {
name: string;
value: any;
description: string;
type: string;
}
interface RuntimeState {
namespace: Map<string, any>;
history: ExecutionResult[];
}
interface ExecutionResult {
code: string;
stdout: string;
stderr: string;
success: boolean;
}
Migration from existing frameworks
You don’t need to rewrite your tools. If you’re using LangChain, AutoGen, or similar frameworks with JSON function calling, wrap your existing functions in Python classes and inject them:
# Existing LangChain tool
@tool
def search_db(query: str) -> str:
return db.execute(query)
# Wrap for CaveAgent
class DBTool:
def search(self, q: str):
return db.execute(q)
agent.inject("db_tool", DBTool())
The agent now calls db_tool.search("...") instead of a JSON schema. Your underlying implementation stays the same.
Common pitfalls
Based on the paper’s discussion and the nature of code-executing agents, watch for these issues:
1. Over-printing. Agents trained on CodeAct habits will print() everything. Use prompt guidance: “Access variables directly; only print summaries.”
2. Security gaps. AST validation catches obvious attacks but not all risks. Run in sandboxed containers. Never inject objects with sensitive methods.
3. State confusion. If the agent doesn’t realize a variable exists, it may re-create it. Keep variable descriptions updated after each turn.
4. Memory leaks. Long sessions accumulate variables. Implement periodic cleanup or let the agent explicitly delete unused objects.
Minimal example
The following snippet shows the basic injection-chat-retrieve pattern:
from caveagent import Agent
agent = Agent(model="deepseek-v3.2")
# Inject data
agent.inject(
"sales",
sales_df,
desc="Q1-Q4 sales, 50K rows"
)
# Chat
response = agent.chat(
"Average transaction by region?"
)
# Agent writes: result = sales.groupby...
# Retrieve native object
result = agent.retrieve("result")
Practical applications
The architecture suits specific use cases:
Data analysis pipelines
Alignment: Data Query and Analysis benchmarks (100% vs 80% baseline)
Load datasets once, run multiple analyses without re-serialization. The agent can build derived DataFrames, train models, and generate visualizations, all referencing persistent objects.
IoT and smart home control
Alignment: Multi-turn stateful benchmark (100% accuracy over 40 turns)
Device states persist in runtime. “Turn off all lights except the kitchen” doesn’t require re-querying every device state; the agent updates the persistent home_state object.
Financial calculations
Alignment: Financial Account scenario (100% accuracy on numerical precision)
Account balances, transaction histories, and calculations maintain exact precision. No floating-point errors from text serialization.
CaveAgent provides deterministic accuracy because calculations happen in Python (a math engine) rather than the LLM (a prediction engine). When your agent computes balance = income - expenses, the result is mathematically exact, not a statistical guess. For fintech and compliance-sensitive applications, this is the difference between “usually right” and “provably correct.”
Multi-agent coordination
Alignment: Runtime-Mediated Multi-Agent Coordination (paper Section E)
CaveAgent’s shared runtime enables coordination patterns that would require complex message-passing protocols in traditional multi-agent systems.
Multi-Agent Coordination via Shared Runtime
Town simulation: agents share state through persistent variables
Multiple agents share a runtime namespace. Agent A updates shared_state["weather"] = "rainy"; Agent B sees it immediately without message passing.
Supply chain example: An Inventory Agent updates a shared stock_levels dictionary when shipments arrive. A Procurement Agent immediately sees the change and adjusts purchase orders, without a single chat message being sent. A Pricing Agent reads the same stock_levels to trigger dynamic pricing. State flows replace message passing, eliminating latency and ambiguity.
This enables:
- Supervisor agents injecting context into worker runtimes
- Swarm coordination through shared variables
- Implicit synchronization without explicit protocols
When to use CaveAgent
Use CaveAgent when:
-
Tasks involve complex data structures. DataFrames, graphs, trained models. Anything that loses fidelity when serialized to text.
-
Multi-turn state consistency matters. Shopping carts, device states, account balances. Where errors compound across turns.
-
Token costs are significant. Long conversations with large data. The 28% token reduction compounds.
-
You need object handoff. The output isn’t text but a Python object for downstream processing (visualization, validation, further computation).
-
Parallel tool calls are common. Code naturally expresses parallelism; JSON schemas require explicit parallel constructs.
Consider alternatives when:
-
Tasks are single-turn. No state to persist. The overhead of maintaining a runtime isn’t justified.
-
Security is paramount. Code execution introduces attack surface. If you can’t sandbox properly, JSON function calling is safer.
-
The LLM struggles with code. Smaller models may generate buggy Python. CaveAgent amplifies coding ability, not compensates for its absence.
-
Latency is critical. The runtime adds startup time. For real-time chat, the overhead may be noticeable.
Limitations
Code generation quality
CaveAgent is only as good as the LLM’s code generation. Models that hallucinate syntax or use wrong APIs will fail. The paper shows best results with DeepSeek-V3.2 and Qwen3-Coder, both strong code models.
CaveAgent’s architecture enables a powerful training signal: you can programmatically verify if an agent’s runtime state is correct without human labels. Did the DataFrame end up with the right values? Is the account balance accurate? This is the foundation for Reinforcement Learning with Verifiable Rewards (RLVR), the same technique behind reasoning models like DeepSeek-R1. CaveAgent could enable automated training pipelines where agents learn from execution outcomes, not human feedback.
Sandbox complexity
Running arbitrary Python requires robust sandboxing. The AST validation catches common attacks, but determined adversaries may find bypasses. Production deployments need container isolation, resource limits, and network restrictions.
Cold start overhead
Each session initializes a Python kernel. For short interactions, this setup time may exceed the task itself. Consider kernel pooling for high-throughput applications.
Debugging opacity
When things go wrong, the dual-stream architecture complicates debugging. The LLM’s reasoning is in one stream; the actual execution state is in another. Logging and observability require capturing both.
Model compatibility
While framework-agnostic in principle, the prompting and code patterns are tuned for specific models. Adapting to new LLMs may require prompt engineering.
Paper: arXiv:2601.01569 Authors: Maohao Ran, Zhenglin Wan, Cooper Lin, et al. (HKUST, HKU, NUS, NTU) Original paper: arXiv ・ PDF ・ HTML
Cite this paper
Maohao Ran, Zhenglin Wan, Cooper Lin, et al. (2026). CaveAgent: Transforming LLMs into Stateful Runtime Operators. arXiv 2026.