-
The Problem. LLM agents generate reasoning steps greedily. Each decision optimizes locally without considering future consequences. Small early mistakes compound into completely wrong trajectories.
-
The Solution. MAXS looks 4 steps ahead before committing to each action. It scores candidate steps using advantage (progress), step variance (consistency), and slope variance (trend smoothness). When trajectories converge, it stops early to save compute.
-
The Results. 63.5% average accuracy vs 52.9% for CoT on five benchmarks. 100x fewer tokens than MCTS while achieving higher accuracy. Lookahead alone provides 5 points of the total gain.
Research Overview
Standard LLM agents are like junior developers who rush to code the first solution that pops into their head. They generate one reasoning step, then the next, then the next—never pausing to consider whether their initial approach will pan out. MAXS is the senior developer who sketches the architecture first, simulating where each decision leads before committing.
The issue with standard agents is greedy generation. The model picks whatever seems best right now without considering where that choice leads four steps down the road. A promising-looking step might dead-end. A slightly worse-looking step might open up better paths.
At each step, the model selects the highest-probability continuation without simulating future steps. This is fast but can miss globally better solutions that require accepting a locally suboptimal choice.
Think of a driver who follows GPS turn-by-turn without looking at the whole map. She turns into a narrow alley that later forces a U-turn. A planner who studies the entire layout might take a longer street now to avoid the dead-end later. The greedy driver saves seconds per turn but may end up circling the block.
Tree-based methods like Tree of Thought (ToT) and Monte Carlo Tree Search (MCTS) address this by exploring multiple paths. But they are expensive. MCTS can consume 100x more tokens than a single chain, making it impractical for production systems.
A search strategy that expands a tree of possible reasoning steps, evaluating many branches in parallel to find a high-quality answer. It offers better global planning than greedy generation but incurs high token and latency costs.
An algorithm that builds a search tree by repeatedly simulating random rollouts from candidate moves, then using the outcomes to bias future selections. In LLM agents it yields thorough exploration but can consume orders of magnitude more tokens than simpler methods.
How different strategies balance lookahead depth vs. computational cost
MAXS takes a middle path. It looks ahead a fixed number of steps (4 by default), scores the candidates using multiple signals, and commits to the best option. When the lookahead trajectories start converging to similar outcomes, it stops early. The result is globally-informed decisions without exhaustive search.
The Greedy Generation Problem
Consider a math problem that requires setting up equations, substituting values, and simplifying. A greedy agent might:
- Choose a substitution that simplifies one term but makes others harder
- Commit to an algebraic manipulation that closes off a cleaner path
- Realize three steps later that it is stuck
By then, the error has propagated. Backtracking means regenerating everything. Most production systems do not backtrack at all because the user is waiting.
Concrete example of failure vs. recovery:
-
CoT approach: The agent calculates 44 × 12 mentally, gets 518 (wrong—it's 528), then uses that incorrect number for 5 more steps. The final answer is completely wrong, but each step "followed logically" from the previous one.
-
MAXS approach: The agent simulates the calculation, sees high variance (uncertainty) in the lookahead trajectories, and recognizes it's in unstable territory. It chooses to use a Python tool instead:
44 * 12 = 528. The correct value propagates forward.
Small errors in early reasoning steps amplify through subsequent steps. A minor deviation at step 2 can lead to completely divergent reasoning by step 10, even if each individual step follows logically from its predecessor.
The paper identifies two core problems:
Locally myopic generation. The agent cannot see beyond its current step. It has no way to evaluate whether a choice leads to productive territory or a dead end.
Trajectory instability. Once committed to a path, small variations cascade. Two runs of the same problem might produce radically different (and differently wrong) answers.
MAXS Architecture
MAXS addresses both problems through three components: lookahead rollouts, multi-signal scoring, and trajectory convergence.
Per-step operations: rollout, value estimation, and trajectory convergence
Lookahead rollouts
At each decision point, MAXS generates candidate next steps. For each candidate, it simulates N additional steps (default N=4) to see where that choice leads.
Think of a mountaineer at a fork in a cliff-side trail. Before committing to a branch, she sends a small drone ahead that flies four meters down each possible path, returning with video of the terrain, loose rocks, and any hidden ledges. She then chooses the branch whose drone footage shows the smoothest, safest ascent. MAXS does the same with reasoning paths: it scouts ahead before committing.
The lookahead is not exhaustive tree search. MAXS samples a fixed number of trajectories per candidate, enough to estimate value without combinatorial explosion.
Value estimation with three signals
Each candidate step gets scored using three signals:
Advantage (R^adv). How much does this step improve the estimated value? Computed as the exponential of the value difference between current and previous positions.
Step variance (R^step). How consistent are the lookahead trajectories? High variance suggests the candidate leads to uncertain territory. Low variance means the outcomes are predictable.
Slope variance (R^slope). How smooth is the trend across lookahead steps? Jerky trajectories indicate unstable reasoning. Smooth trajectories suggest the agent is making steady progress.
The final score combines all three:
Score = Advantage + α × StepVariance + β × SlopeVariance
Where α and β weight the variance terms. The paper finds optimal values of α=0.3 and β=0.2.
Dynamic compute spending
MAXS automatically stops "thinking" when it's confident, unlike fixed-step chains that waste money on easy problems. If the lookahead trajectories for different candidates start producing similar outcomes, further exploration adds no value.
MAXS monitors variance and halts rollouts when consistency falls below a threshold δ. This early stopping saves significant compute on "easy" decisions where the right choice is clear. Hard problems get more thinking time; easy problems get resolved quickly.
Risk-Aware Scoring
The multi-signal approach prevents MAXS from being fooled by shallow progress. Think of it as a risk management layer for reasoning:
Optimal configuration: alpha=0.3 (step variance), beta=0.2 (slope variance)
| Signal | Technical Name | Business Translation | What It Measures |
|---|---|---|---|
| Advantage | R^adv | Expected Profit | Is this step better than alternatives? |
| Step Variance | R^step | Certainty | Are we confident it works? |
| Slope Variance | R^slope | Stability | Is progress sustainable or erratic? |
Why "expected profit" alone is insufficient. A step might show high immediate reward but lead to uncertain outcomes. The model is making progress but into risky territory. Purely profit-based selection would chase volatile gains.
Why certainty matters. Low step variance indicates the agent has found a stable reasoning region. The lookahead trajectories agree on the outcome, suggesting robustness. High variance means the path is sensitive to small perturbations—a red flag for production systems.
Why stability matters. Even with high certainty at the endpoint, the trajectory might be oscillating rather than steadily improving. Slope variance captures trend smoothness. A steadily improving trajectory is preferable to one that jumps around unpredictably.
The combination produces more reliable selections. The paper shows that adding risk signals (α=0.3, β=0.2) improves accuracy by 8.3 points over profit-only scoring.
Benchmark Results
MAXS was evaluated on five reasoning benchmarks using MiMo-VL-7B and Qwen2.5-VL-7B as base models.
Accuracy (%) across five reasoning benchmarks
Main results
| Dataset | CoT | ToT | MCTS | MAXS |
|---|---|---|---|---|
| MathVista | 77.2 | 73.9 | 75.3 | 85.5 |
| OlympiadBench | 41.6 | 43.0 | 26.6 | 48.5 |
| EMMA | 33.3 | 39.3 | 29.0 | 46.7 |
| TheoremQA | 46.9 | 59.3 | 40.5 | 61.0 |
| MATH | 65.7 | 69.7 | 72.7 | 75.7 |
Results shown for MiMo-VL-7B. MAXS achieves the highest accuracy on all five benchmarks.
Key observations
MathVista sees the largest gains. MAXS improves from 77.2% (CoT) to 85.5%, an 8.3 point increase. MathVista requires visual reasoning combined with mathematical computation, exactly the type of multi-step problem where lookahead helps most.
MCTS underperforms despite exhaustive search. On OlympiadBench, MCTS achieves only 26.6% compared to MAXS at 48.5%. The exhaustive search strategy dilutes focus across too many branches, while MAXS concentrates compute on promising candidates.
Smaller model shows larger relative improvements. On Qwen2.5-VL-7B (not shown in table), MAXS improves average accuracy from 35.2% (CoT) to 42.9%, a 22% relative gain.
Inference Efficiency
Raw accuracy numbers hide a critical dimension: computational cost. MAXS achieves better results with far fewer tokens.
Inference-time scaling: accuracy vs. token consumption
Token consumption comparison
| Method | Total Tokens | Accuracy |
|---|---|---|
| CoT | 2.67 × 10⁷ | 52.9% |
| Guided Decoding | 1.67 × 10⁸ | 47.9% |
| φ-Decoding | 7.66 × 10⁸ | 54.5% |
| MAXS | 9.86 × 10⁸ | 63.5% |
| ToT | 6.40 × 10¹⁰ | 57.0% |
| MCTS | 9.91 × 10¹⁰ | 48.8% |
MAXS uses roughly the same tokens as φ-Decoding but achieves 9 points higher accuracy. Compared to MCTS, MAXS uses 100x fewer tokens while scoring 15 points higher.
In dollar terms: At typical API pricing (~$0.01 per 1K tokens), MCTS costs roughly $1.00 per problem in tokens. MAXS costs about $0.01 per problem—for better results. That's the difference between a viable production system and a research curiosity.
The scaling sweet spot
The scatter plot reveals MAXS occupying an optimal position: top-left quadrant where accuracy is high and token usage is moderate. Tree-based methods (ToT, MCTS) sit in the bottom-right, paying massive token costs for mediocre results.
Recent work shows that spending more compute at inference (via longer chains, search, or verification) can improve LLM performance. But the returns diminish quickly. MAXS demonstrates that targeted lookahead is more efficient than exhaustive exploration.
What Actually Drives Performance?
Which components of MAXS are worth building, and which can be skipped to save complexity? The researchers removed each component individually to measure its ROI.
Ablation study: impact of removing each MAXS component
Component contributions
| Configuration | MiMo-VL-7B | Qwen-7B |
|---|---|---|
| MAXS (Full) | 63.46% | 42.85% |
| w/o Lookahead | 58.50% | 33.41% |
| w/o Advantage | 60.96% | 36.94% |
| w/o Step Variance | 61.35% | 37.67% |
| w/o Slope Variance | 62.41% | 38.79% |
| w/o Trajectory Convergence | 63.03% | 41.60% |
Key findings
Lookahead is critical. Removing lookahead drops accuracy by 5 points on MiMo-VL and nearly 10 points on Qwen. This confirms that seeing future consequences of current decisions is the core mechanism.
Variance signals matter more for weaker models. Removing step variance hurts Qwen-7B by 5 points but MiMo-VL-7B by only 2 points. Weaker models benefit more from the stabilizing effect of variance-based selection.
Trajectory convergence has minimal accuracy impact. The accuracy drop is small (0.43% for MiMo-VL), but this component reduces token usage by stopping early when further lookahead would not change the decision.
Lookahead depth analysis
Comparing 4-step vs 6-step lookahead depth
The paper examines whether deeper lookahead (6 steps instead of 4) helps. The answer is barely: 6-step achieves 85.8% vs 4-step's 85.5% on MathVista. But 6-step uses 49.8% more tokens. The cost-benefit ratio favors shallower lookahead.
Tool Strategy Guide
MAXS supports tool use (code execution, web search). The data reveals clear patterns for when to prioritize each tool.
Accuracy drop when removing each tool capability
Overall impact
| Configuration | Average Accuracy | Change |
|---|---|---|
| Both Tools | 63.46% | baseline |
| No Code | 60.81% | -2.65% |
| No Search | 56.36% | -7.10% |
| No Tools | 52.06% | -11.40% |
Web search contributes more to overall accuracy than code execution. Removing search drops performance by 7.1 points, while removing code drops it by only 2.65 points.
Task-specific tool importance
The overall numbers mask task-specific dependencies. On MathVista (a mathematical reasoning benchmark):
| Configuration | MathVista Accuracy | Change |
|---|---|---|
| Full MAXS | 85.5% | baseline |
| No Code | 70.8% | -14.7% |
| No Search | 81.4% | -4.1% |
Here the pattern reverses. Code execution is critical (14.7 point drop without it), while search contributes less (4.1 points). Mathematical problems require symbolic computation that code excels at.
To see why, consider two concrete examples:
-
Fact-retrieval question: "What is the capital of Mongolia?" The optimal answer requires a factual lookup. When MAXS has web search enabled, it queries and receives "Ulaanbaatar." Disabling search forces the model to rely on memorized knowledge, which may be incomplete or outdated.
-
Symbolic math problem: "Compute the determinant of [[2,-1,0],[0,3,-2],[1,0,4]]" Solving this efficiently requires exact arithmetic. With code execution, MAXS generates
np.linalg.det(...)and runs it, obtaining the exact answer. Without code, the model must perform the arithmetic mentally, often yielding rounding errors or algebraic mistakes.
Tool selection heuristic
| Task Type | Prioritize | Expected Gain | Example |
|---|---|---|---|
| Math/Logic/Symbolic | Code Execution | +15 points | Matrix operations, equation solving |
| Knowledge/Facts | Web Search | +7 points | Current events, entity lookups |
| Mixed reasoning | Both Tools | +11 points | Multi-step problems with fact retrieval |
Practical implication: Tool availability should match task type. For a math tutoring agent, code execution is non-negotiable. For a research assistant, web search is the priority. For general-purpose agents, enable both.
Implementation Blueprint
Core algorithm
from dataclasses import dataclass
from typing import List, Optional
import numpy as np
@dataclass
class MAXSConfig:
lookahead_depth: int = 4
num_candidates: int = 4
num_rollouts: int = 3
alpha: float = 0.3 # step variance weight
beta: float = 0.2 # slope variance weight
convergence_threshold: float = 0.1
def maxs_step(
model: LLMAgent,
state: ReasoningState,
tools: List[Tool],
config: MAXSConfig
) -> Action:
"""Select next action using lookahead + multi-signal scoring."""
# Generate candidate next steps
candidates = model.generate_candidates(state, n=config.num_candidates)
# Score each candidate via lookahead
scored = []
for candidate in candidates:
# Simulate N future trajectories (can be batched for efficiency)
trajectories = [
rollout(model, state.apply(candidate), config.lookahead_depth, tools)
for _ in range(config.num_rollouts)
]
# Extract terminal values
values = np.array([t.final_value for t in trajectories])
# Three scoring signals
advantage = np.exp(values.mean() - state.value) # expected profit
step_var = np.exp(-values.var()) # certainty
slope_var = compute_slope_variance(trajectories) # stability
score = advantage + config.alpha * step_var + config.beta * slope_var
scored.append((candidate, score, values.var()))
# Dynamic compute: stop early if high certainty
if values.var() < config.convergence_threshold:
break
# Return highest-scoring candidate
return max(scored, key=lambda x: x[1])[0]Production tip: The inner loop over candidates is embarrassingly parallel. Batch all lookahead rollouts into a single model call to cut latency by 3-4x.
Recommended hyperparameters
| Parameter | Value | Notes |
|---|---|---|
| Lookahead depth | 4 | Diminishing returns beyond 4 |
| Number of candidates | 3-5 | Trade-off between diversity and cost |
| Rollouts per candidate | 2-4 | Enough for variance estimation |
| α (step variance weight) | 0.3 | Paper-optimized value |
| β (slope variance weight) | 0.2 | Paper-optimized value |
| Convergence threshold δ | 0.1 | Lower = more aggressive early stopping |
Integration considerations
Tool execution during lookahead. The paper executes tools during lookahead rollouts. This adds cost but provides accurate value estimates. For cost-sensitive deployments, consider tool-free lookahead with tool execution only on committed steps.
Model selection. MAXS shows larger relative gains on smaller models. If you are using a 7B model, MAXS is likely worthwhile. For frontier models with strong base reasoning, the gains may be smaller.
Batching. Lookahead rollouts can be batched for efficiency. Generate all candidates, then batch their lookahead trajectories through the model in parallel.
Limitations
Latency cost (be honest about this)
While 100x cheaper than MCTS in tokens, MAXS still adds significant latency. Looking ahead 4 steps means 4-10x longer response times compared to single-stream generation. Each decision point requires multiple forward passes for lookahead simulation.
Where MAXS shines: Offline batch processing, complex queries where accuracy matters more than speed, high-stakes decisions (legal, medical, financial analysis).
Where MAXS struggles: Real-time chat, interactive applications where users expect sub-second responses, simple queries that don't need lookahead.
Value estimation quality
The value estimation depends on the model's ability to judge its own progress. If the base model cannot reliably estimate solution quality, the lookahead signals become noisy.
Tool-dependent gains
Most of the impressive MathVista results come with code execution enabled. Without tools, MAXS still outperforms baselines but by smaller margins. The approach works best when appropriate tools are available.
Limited evaluation scope
The benchmarks focus on mathematical and logical reasoning. Performance on open-ended generation, creative tasks, or dialogue is unexplored.
Code: GitHub
Authors: Jian Zhang, Zhiyuan Wang, Zhangqi Wang, Yu He, Haoran Luo, Li Yuan, Lingling Zhang, Rui Mao, Qika Lin, Jun Liu
Cite this paper
Jian Zhang, Zhiyuan Wang, Zhangqi Wang, Yu He, Haoran Luo, Li Yuan, Lingling Zhang, Rui Mao, Qika Lin, Jun Liu (2026). MAXS: The 'Measure Twice, Cut Once' Agent Architecture. arXiv 2026.