arXiv 2026January 14, 2026

MAXS: The 'Measure Twice, Cut Once' Agent Architecture

Jian Zhanget al.

MAXS introduces meta-adaptive exploration for LLM agents. Instead of greedy step-by-step generation, MAXS looks ahead 4 steps, scores candidates using advantage plus variance signals, and stops early when trajectories converge. The approach achieves better accuracy than exhaustive search methods while using orders of magnitude fewer tokens.

Categories:AI AgentsReasoningInference Optimization

Key Findings

1

4-step lookahead adds 5 points: Removing lookahead drops MiMo-VL-7B from 63.46% to 58.50% average accuracy across five benchmarks

2

100x more efficient than MCTS: MAXS achieves 63.46% accuracy with 9.86e8 tokens while MCTS uses 9.91e10 tokens for only 48.8%

3

Variance weights matter: Adding step variance (alpha=0.3) and slope variance (beta=0.2) to advantage scoring improves accuracy by 8.3 points

4

Tools are task-dependent: Web search adds 7.1 points overall accuracy, but code execution adds 14.7 points on MathVista specifically

5

Early stopping saves compute: Trajectory convergence reduces token usage by halting when path consistency stabilizes

TL;DR
  1. The Problem. LLM agents generate reasoning steps greedily. Each decision optimizes locally without considering future consequences. Small early mistakes compound into completely wrong trajectories.

  2. The Solution. MAXS looks 4 steps ahead before committing to each action. It scores candidate steps using advantage (progress), step variance (consistency), and slope variance (trend smoothness). When trajectories converge, it stops early to save compute.

  3. The Results. 63.5% average accuracy vs 52.9% for CoT on five benchmarks. 100x fewer tokens than MCTS while achieving higher accuracy. Lookahead alone provides 5 points of the total gain.

Research Overview

Standard LLM agents are like junior developers who rush to code the first solution that pops into their head. They generate one reasoning step, then the next, then the next—never pausing to consider whether their initial approach will pan out. MAXS is the senior developer who sketches the architecture first, simulating where each decision leads before committing.

The issue with standard agents is greedy generation. The model picks whatever seems best right now without considering where that choice leads four steps down the road. A promising-looking step might dead-end. A slightly worse-looking step might open up better paths.

Greedy Generation

At each step, the model selects the highest-probability continuation without simulating future steps. This is fast but can miss globally better solutions that require accepting a locally suboptimal choice.

Think of a driver who follows GPS turn-by-turn without looking at the whole map. She turns into a narrow alley that later forces a U-turn. A planner who studies the entire layout might take a longer street now to avoid the dead-end later. The greedy driver saves seconds per turn but may end up circling the block.

Tree-based methods like Tree of Thought (ToT) and Monte Carlo Tree Search (MCTS) address this by exploring multiple paths. But they are expensive. MCTS can consume 100x more tokens than a single chain, making it impractical for production systems.

Tree of Thought (ToT)

A search strategy that expands a tree of possible reasoning steps, evaluating many branches in parallel to find a high-quality answer. It offers better global planning than greedy generation but incurs high token and latency costs.

Monte Carlo Tree Search (MCTS)

An algorithm that builds a search tree by repeatedly simulating random rollouts from candidate moves, then using the outcomes to bias future selections. In LLM agents it yields thorough exploration but can consume orders of magnitude more tokens than simpler methods.

From Greedy to Meta-Adaptive Reasoning

How different strategies balance lookahead depth vs. computational cost

Strategy ComparisonSource: Figure 2

MAXS takes a middle path. It looks ahead a fixed number of steps (4 by default), scores the candidates using multiple signals, and commits to the best option. When the lookahead trajectories start converging to similar outcomes, it stops early. The result is globally-informed decisions without exhaustive search.

The Greedy Generation Problem

Consider a math problem that requires setting up equations, substituting values, and simplifying. A greedy agent might:

  1. Choose a substitution that simplifies one term but makes others harder
  2. Commit to an algebraic manipulation that closes off a cleaner path
  3. Realize three steps later that it is stuck

By then, the error has propagated. Backtracking means regenerating everything. Most production systems do not backtrack at all because the user is waiting.

Concrete example of failure vs. recovery:

  • CoT approach: The agent calculates 44 × 12 mentally, gets 518 (wrong—it's 528), then uses that incorrect number for 5 more steps. The final answer is completely wrong, but each step "followed logically" from the previous one.

  • MAXS approach: The agent simulates the calculation, sees high variance (uncertainty) in the lookahead trajectories, and recognizes it's in unstable territory. It chooses to use a Python tool instead: 44 * 12 = 528. The correct value propagates forward.

Trajectory Instability

Small errors in early reasoning steps amplify through subsequent steps. A minor deviation at step 2 can lead to completely divergent reasoning by step 10, even if each individual step follows logically from its predecessor.

The paper identifies two core problems:

Locally myopic generation. The agent cannot see beyond its current step. It has no way to evaluate whether a choice leads to productive territory or a dead end.

Trajectory instability. Once committed to a path, small variations cascade. Two runs of the same problem might produce radically different (and differently wrong) answers.

MAXS Architecture

MAXS addresses both problems through three components: lookahead rollouts, multi-signal scoring, and trajectory convergence.

MAXS Framework: Lookahead + Multi-Signal Scoring

Per-step operations: rollout, value estimation, and trajectory convergence

Architecture DiagramSource: Figure 3

Lookahead rollouts

At each decision point, MAXS generates candidate next steps. For each candidate, it simulates N additional steps (default N=4) to see where that choice leads.

Think of a mountaineer at a fork in a cliff-side trail. Before committing to a branch, she sends a small drone ahead that flies four meters down each possible path, returning with video of the terrain, loose rocks, and any hidden ledges. She then chooses the branch whose drone footage shows the smoothest, safest ascent. MAXS does the same with reasoning paths: it scouts ahead before committing.

The lookahead is not exhaustive tree search. MAXS samples a fixed number of trajectories per candidate, enough to estimate value without combinatorial explosion.

Value estimation with three signals

Each candidate step gets scored using three signals:

Advantage (R^adv). How much does this step improve the estimated value? Computed as the exponential of the value difference between current and previous positions.

Step variance (R^step). How consistent are the lookahead trajectories? High variance suggests the candidate leads to uncertain territory. Low variance means the outcomes are predictable.

Slope variance (R^slope). How smooth is the trend across lookahead steps? Jerky trajectories indicate unstable reasoning. Smooth trajectories suggest the agent is making steady progress.

The final score combines all three:

Score = Advantage + α × StepVariance + β × SlopeVariance

Where α and β weight the variance terms. The paper finds optimal values of α=0.3 and β=0.2.

Dynamic compute spending

MAXS automatically stops "thinking" when it's confident, unlike fixed-step chains that waste money on easy problems. If the lookahead trajectories for different candidates start producing similar outcomes, further exploration adds no value.

MAXS monitors variance and halts rollouts when consistency falls below a threshold δ. This early stopping saves significant compute on "easy" decisions where the right choice is clear. Hard problems get more thinking time; easy problems get resolved quickly.

Risk-Aware Scoring

The multi-signal approach prevents MAXS from being fooled by shallow progress. Think of it as a risk management layer for reasoning:

Variance Weights Add +8.3% Accuracy

Optimal configuration: alpha=0.3 (step variance), beta=0.2 (slope variance)

Comparison ChartSource: Figure 7
SignalTechnical NameBusiness TranslationWhat It Measures
AdvantageR^advExpected ProfitIs this step better than alternatives?
Step VarianceR^stepCertaintyAre we confident it works?
Slope VarianceR^slopeStabilityIs progress sustainable or erratic?

Why "expected profit" alone is insufficient. A step might show high immediate reward but lead to uncertain outcomes. The model is making progress but into risky territory. Purely profit-based selection would chase volatile gains.

Why certainty matters. Low step variance indicates the agent has found a stable reasoning region. The lookahead trajectories agree on the outcome, suggesting robustness. High variance means the path is sensitive to small perturbations—a red flag for production systems.

Why stability matters. Even with high certainty at the endpoint, the trajectory might be oscillating rather than steadily improving. Slope variance captures trend smoothness. A steadily improving trajectory is preferable to one that jumps around unpredictably.

The combination produces more reliable selections. The paper shows that adding risk signals (α=0.3, β=0.2) improves accuracy by 8.3 points over profit-only scoring.

Benchmark Results

MAXS was evaluated on five reasoning benchmarks using MiMo-VL-7B and Qwen2.5-VL-7B as base models.

MAXS Outperforms All Baselines

Accuracy (%) across five reasoning benchmarks

Grouped Bar ChartSource: Table 1

Main results

DatasetCoTToTMCTSMAXS
MathVista77.273.975.385.5
OlympiadBench41.643.026.648.5
EMMA33.339.329.046.7
TheoremQA46.959.340.561.0
MATH65.769.772.775.7

Results shown for MiMo-VL-7B. MAXS achieves the highest accuracy on all five benchmarks.

Key observations

MathVista sees the largest gains. MAXS improves from 77.2% (CoT) to 85.5%, an 8.3 point increase. MathVista requires visual reasoning combined with mathematical computation, exactly the type of multi-step problem where lookahead helps most.

MCTS underperforms despite exhaustive search. On OlympiadBench, MCTS achieves only 26.6% compared to MAXS at 48.5%. The exhaustive search strategy dilutes focus across too many branches, while MAXS concentrates compute on promising candidates.

Smaller model shows larger relative improvements. On Qwen2.5-VL-7B (not shown in table), MAXS improves average accuracy from 35.2% (CoT) to 42.9%, a 22% relative gain.

Inference Efficiency

Raw accuracy numbers hide a critical dimension: computational cost. MAXS achieves better results with far fewer tokens.

MAXS: Best Accuracy-to-Cost Ratio

Inference-time scaling: accuracy vs. token consumption

Scatter PlotSource: Figure 4

Token consumption comparison

MethodTotal TokensAccuracy
CoT2.67 × 10⁷52.9%
Guided Decoding1.67 × 10⁸47.9%
φ-Decoding7.66 × 10⁸54.5%
MAXS9.86 × 10⁸63.5%
ToT6.40 × 10¹⁰57.0%
MCTS9.91 × 10¹⁰48.8%

MAXS uses roughly the same tokens as φ-Decoding but achieves 9 points higher accuracy. Compared to MCTS, MAXS uses 100x fewer tokens while scoring 15 points higher.

In dollar terms: At typical API pricing (~$0.01 per 1K tokens), MCTS costs roughly $1.00 per problem in tokens. MAXS costs about $0.01 per problem—for better results. That's the difference between a viable production system and a research curiosity.

The scaling sweet spot

The scatter plot reveals MAXS occupying an optimal position: top-left quadrant where accuracy is high and token usage is moderate. Tree-based methods (ToT, MCTS) sit in the bottom-right, paying massive token costs for mediocre results.

Inference-Time Scaling

Recent work shows that spending more compute at inference (via longer chains, search, or verification) can improve LLM performance. But the returns diminish quickly. MAXS demonstrates that targeted lookahead is more efficient than exhaustive exploration.

What Actually Drives Performance?

Which components of MAXS are worth building, and which can be skipped to save complexity? The researchers removed each component individually to measure its ROI.

Lookahead Drives the Biggest Gains

Ablation study: impact of removing each MAXS component

Ablation Bar ChartSource: Table 3

Component contributions

ConfigurationMiMo-VL-7BQwen-7B
MAXS (Full)63.46%42.85%
w/o Lookahead58.50%33.41%
w/o Advantage60.96%36.94%
w/o Step Variance61.35%37.67%
w/o Slope Variance62.41%38.79%
w/o Trajectory Convergence63.03%41.60%

Key findings

Lookahead is critical. Removing lookahead drops accuracy by 5 points on MiMo-VL and nearly 10 points on Qwen. This confirms that seeing future consequences of current decisions is the core mechanism.

Variance signals matter more for weaker models. Removing step variance hurts Qwen-7B by 5 points but MiMo-VL-7B by only 2 points. Weaker models benefit more from the stabilizing effect of variance-based selection.

Trajectory convergence has minimal accuracy impact. The accuracy drop is small (0.43% for MiMo-VL), but this component reduces token usage by stopping early when further lookahead would not change the decision.

Lookahead depth analysis

4-Step Lookahead: Best Cost-Accuracy Balance

Comparing 4-step vs 6-step lookahead depth

Comparison ChartSource: Figure 5

The paper examines whether deeper lookahead (6 steps instead of 4) helps. The answer is barely: 6-step achieves 85.8% vs 4-step's 85.5% on MathVista. But 6-step uses 49.8% more tokens. The cost-benefit ratio favors shallower lookahead.

Tool Strategy Guide

MAXS supports tool use (code execution, web search). The data reveals clear patterns for when to prioritize each tool.

Web Search Matters More Than Code

Accuracy drop when removing each tool capability

Horizontal Bar ChartSource: Figure 6

Overall impact

ConfigurationAverage AccuracyChange
Both Tools63.46%baseline
No Code60.81%-2.65%
No Search56.36%-7.10%
No Tools52.06%-11.40%

Web search contributes more to overall accuracy than code execution. Removing search drops performance by 7.1 points, while removing code drops it by only 2.65 points.

Task-specific tool importance

The overall numbers mask task-specific dependencies. On MathVista (a mathematical reasoning benchmark):

ConfigurationMathVista AccuracyChange
Full MAXS85.5%baseline
No Code70.8%-14.7%
No Search81.4%-4.1%

Here the pattern reverses. Code execution is critical (14.7 point drop without it), while search contributes less (4.1 points). Mathematical problems require symbolic computation that code excels at.

To see why, consider two concrete examples:

  • Fact-retrieval question: "What is the capital of Mongolia?" The optimal answer requires a factual lookup. When MAXS has web search enabled, it queries and receives "Ulaanbaatar." Disabling search forces the model to rely on memorized knowledge, which may be incomplete or outdated.

  • Symbolic math problem: "Compute the determinant of [[2,-1,0],[0,3,-2],[1,0,4]]" Solving this efficiently requires exact arithmetic. With code execution, MAXS generates np.linalg.det(...) and runs it, obtaining the exact answer. Without code, the model must perform the arithmetic mentally, often yielding rounding errors or algebraic mistakes.

Tool selection heuristic

Task TypePrioritizeExpected GainExample
Math/Logic/SymbolicCode Execution+15 pointsMatrix operations, equation solving
Knowledge/FactsWeb Search+7 pointsCurrent events, entity lookups
Mixed reasoningBoth Tools+11 pointsMulti-step problems with fact retrieval

Practical implication: Tool availability should match task type. For a math tutoring agent, code execution is non-negotiable. For a research assistant, web search is the priority. For general-purpose agents, enable both.

Implementation Blueprint

Core algorithm

from dataclasses import dataclass
from typing import List, Optional
import numpy as np
 
@dataclass
class MAXSConfig:
    lookahead_depth: int = 4
    num_candidates: int = 4
    num_rollouts: int = 3
    alpha: float = 0.3  # step variance weight
    beta: float = 0.2   # slope variance weight
    convergence_threshold: float = 0.1
 
def maxs_step(
    model: LLMAgent,
    state: ReasoningState,
    tools: List[Tool],
    config: MAXSConfig
) -> Action:
    """Select next action using lookahead + multi-signal scoring."""
 
    # Generate candidate next steps
    candidates = model.generate_candidates(state, n=config.num_candidates)
 
    # Score each candidate via lookahead
    scored = []
    for candidate in candidates:
        # Simulate N future trajectories (can be batched for efficiency)
        trajectories = [
            rollout(model, state.apply(candidate), config.lookahead_depth, tools)
            for _ in range(config.num_rollouts)
        ]
 
        # Extract terminal values
        values = np.array([t.final_value for t in trajectories])
 
        # Three scoring signals
        advantage = np.exp(values.mean() - state.value)  # expected profit
        step_var = np.exp(-values.var())                 # certainty
        slope_var = compute_slope_variance(trajectories) # stability
 
        score = advantage + config.alpha * step_var + config.beta * slope_var
        scored.append((candidate, score, values.var()))
 
        # Dynamic compute: stop early if high certainty
        if values.var() < config.convergence_threshold:
            break
 
    # Return highest-scoring candidate
    return max(scored, key=lambda x: x[1])[0]

Production tip: The inner loop over candidates is embarrassingly parallel. Batch all lookahead rollouts into a single model call to cut latency by 3-4x.

ParameterValueNotes
Lookahead depth4Diminishing returns beyond 4
Number of candidates3-5Trade-off between diversity and cost
Rollouts per candidate2-4Enough for variance estimation
α (step variance weight)0.3Paper-optimized value
β (slope variance weight)0.2Paper-optimized value
Convergence threshold δ0.1Lower = more aggressive early stopping

Integration considerations

Tool execution during lookahead. The paper executes tools during lookahead rollouts. This adds cost but provides accurate value estimates. For cost-sensitive deployments, consider tool-free lookahead with tool execution only on committed steps.

Model selection. MAXS shows larger relative gains on smaller models. If you are using a 7B model, MAXS is likely worthwhile. For frontier models with strong base reasoning, the gains may be smaller.

Batching. Lookahead rollouts can be batched for efficiency. Generate all candidates, then batch their lookahead trajectories through the model in parallel.

Limitations

Latency cost (be honest about this)

While 100x cheaper than MCTS in tokens, MAXS still adds significant latency. Looking ahead 4 steps means 4-10x longer response times compared to single-stream generation. Each decision point requires multiple forward passes for lookahead simulation.

Where MAXS shines: Offline batch processing, complex queries where accuracy matters more than speed, high-stakes decisions (legal, medical, financial analysis).

Where MAXS struggles: Real-time chat, interactive applications where users expect sub-second responses, simple queries that don't need lookahead.

Value estimation quality

The value estimation depends on the model's ability to judge its own progress. If the base model cannot reliably estimate solution quality, the lookahead signals become noisy.

Tool-dependent gains

Most of the impressive MathVista results come with code execution enabled. Without tools, MAXS still outperforms baselines but by smaller margins. The approach works best when appropriate tools are available.

Limited evaluation scope

The benchmarks focus on mathematical and logical reasoning. Performance on open-ended generation, creative tasks, or dialogue is unexplored.


Original paper: arXivPDF

Code: GitHub

Authors: Jian Zhang, Zhiyuan Wang, Zhangqi Wang, Yu He, Haoran Luo, Li Yuan, Lingling Zhang, Rui Mao, Qika Lin, Jun Liu

Authors

Jian ZhangUnknown,Zhiyuan WangUnknown,Zhangqi WangUnknown,Yu HeUnknown,Haoran LuoUnknown,Li YuanUnknown,Lingling ZhangUnknown,Rui MaoUnknown,Qika LinUnknown,Jun LiuUnknown

Cite this paper

Jian Zhang, Zhiyuan Wang, Zhangqi Wang, Yu He, Haoran Luo, Li Yuan, Lingling Zhang, Rui Mao, Qika Lin, Jun Liu (2026). MAXS: The 'Measure Twice, Cut Once' Agent Architecture. arXiv 2026.

Related Research