MAXS: The 'Measure Twice, Cut Once' Agent Architecture

TL;DR

The Problem. LLM agents generate reasoning steps greedily. Each decision optimizes locally without considering future consequences. Small early mistakes compound into completely wrong trajectories.
The Solution. MAXS looks 4 steps ahead before committing to each action. It scores candidate steps using advantage (progress), step variance (consistency), and slope variance (trend smoothness). When trajectories converge, it stops early to save compute.
The Results. 63.5% average accuracy vs 52.9% for CoT on five benchmarks. 100x fewer tokens than MCTS while achieving higher accuracy. Lookahead alone provides 5 points of the total gain.

Research Overview

Standard LLM agents are like junior developers who rush to code the first solution that pops into their head. They generate one reasoning step, then the next, then the next—never pausing to consider whether their initial approach will pan out. MAXS is the senior developer who sketches the architecture first, simulating where each decision leads before committing.

The issue with standard agents is greedy generation. The model picks whatever seems best right now without considering where that choice leads four steps down the road. A promising-looking step might dead-end. A slightly worse-looking step might open up better paths.

Greedy Generation

At each step, the model selects the highest-probability continuation without simulating future steps. This is fast but can miss globally better solutions that require accepting a locally suboptimal choice.

Think of a driver who follows GPS turn-by-turn without looking at the whole map. She turns into a narrow alley that later forces a U-turn. A planner who studies the entire layout might take a longer street now to avoid the dead-end later. The greedy driver saves seconds per turn but may end up circling the block.

Tree-based methods like Tree of Thought (ToT) and Monte Carlo Tree Search (MCTS) address this by exploring multiple paths. But they are expensive. MCTS can consume 100x more tokens than a single chain, making it impractical for production systems.

Tree of Thought (ToT)

A search strategy that expands a tree of possible reasoning steps, evaluating many branches in parallel to find a high-quality answer. It offers better global planning than greedy generation but incurs high token and latency costs.

Monte Carlo Tree Search (MCTS)

An algorithm that builds a search tree by repeatedly simulating random rollouts from candidate moves, then using the outcomes to bias future selections. In LLM agents it yields thorough exploration but can consume orders of magnitude more tokens than simpler methods.

From Greedy to Meta-Adaptive Reasoning

How different strategies balance lookahead depth vs. computational cost

Strategy ComparisonSource: Figure 2

MAXS takes a middle path. It looks ahead a fixed number of steps (4 by default), scores the candidates using multiple signals, and commits to the best option. When the lookahead trajectories start converging to similar outcomes, it stops early. The result is globally-informed decisions without exhaustive search.

The Greedy Generation Problem

Consider a math problem that requires setting up equations, substituting values, and simplifying. A greedy agent might:

Choose a substitution that simplifies one term but makes others harder
Commit to an algebraic manipulation that closes off a cleaner path
Realize three steps later that it is stuck

By then, the error has propagated. Backtracking means regenerating everything. Most production systems do not backtrack at all because the user is waiting.

Concrete example of failure vs. recovery:

CoT approach: The agent calculates 44 × 12 mentally, gets 518 (wrong—it's 528), then uses that incorrect number for 5 more steps. The final answer is completely wrong, but each step "followed logically" from the previous one.
MAXS approach: The agent simulates the calculation, sees high variance (uncertainty) in the lookahead trajectories, and recognizes it's in unstable territory. It chooses to use a Python tool instead: 44 * 12 = 528. The correct value propagates forward.

Trajectory Instability

Small errors in early reasoning steps amplify through subsequent steps. A minor deviation at step 2 can lead to completely divergent reasoning by step 10, even if each individual step follows logically from its predecessor.

The paper identifies two core problems:

Locally myopic generation. The agent cannot see beyond its current step. It has no way to evaluate whether a choice leads to productive territory or a dead end.

Trajectory instability. Once committed to a path, small variations cascade. Two runs of the same problem might produce radically different (and differently wrong) answers.

MAXS Architecture

MAXS addresses both problems through three components: lookahead rollouts, multi-signal scoring, and trajectory convergence.

MAXS Framework: Lookahead + Multi-Signal Scoring

Per-step operations: rollout, value estimation, and trajectory convergence

Architecture DiagramSource: Figure 3

Lookahead rollouts

At each decision point, MAXS generates candidate next steps. For each candidate, it simulates N additional steps (default N=4) to see where that choice leads.

Think of a mountaineer at a fork in a cliff-side trail. Before committing to a branch, she sends a small drone ahead that flies four meters down each possible path, returning with video of the terrain, loose rocks, and any hidden ledges. She then chooses the branch whose drone footage shows the smoothest, safest ascent. MAXS does the same with reasoning paths: it scouts ahead before committing.

The lookahead is not exhaustive tree search. MAXS samples a fixed number of trajectories per candidate, enough to estimate value without combinatorial explosion.

Value estimation with three signals

Each candidate step gets scored using three signals:

Advantage (R^adv). How much does this step improve the estimated value? Computed as the exponential of the value difference between current and previous positions.

Step variance (R^step). How consistent are the lookahead trajectories? High variance suggests the candidate leads to uncertain territory. Low variance means the outcomes are predictable.

Slope variance (R^slope). How smooth is the trend across lookahead steps? Jerky trajectories indicate unstable reasoning. Smooth trajectories suggest the agent is making steady progress.

The final score combines all three:

Score = Advantage + α × StepVariance + β × SlopeVariance

Where α and β weight the variance terms. The paper finds optimal values of α=0.3 and β=0.2.

Dynamic compute spending

MAXS automatically stops "thinking" when it's confident, unlike fixed-step chains that waste money on easy problems. If the lookahead trajectories for different candidates start producing similar outcomes, further exploration adds no value.

MAXS monitors variance and halts rollouts when consistency falls below a threshold δ. This early stopping saves significant compute on "easy" decisions where the right choice is clear. Hard problems get more thinking time; easy problems get resolved quickly.

Risk-Aware Scoring

The multi-signal approach prevents MAXS from being fooled by shallow progress. Think of it as a risk management layer for reasoning:

Variance Weights Add +8.3% Accuracy

Optimal configuration: alpha=0.3 (step variance), beta=0.2 (slope variance)

Comparison ChartSource: Figure 7

Signal	Technical Name	Business Translation	What It Measures
Advantage	R^adv	Expected Profit	Is this step better than alternatives?
Step Variance	R^step	Certainty	Are we confident it works?
Slope Variance	R^slope	Stability	Is progress sustainable or erratic?

Why "expected profit" alone is insufficient. A step might show high immediate reward but lead to uncertain outcomes. The model is making progress but into risky territory. Purely profit-based selection would chase volatile gains.

Why certainty matters. Low step variance indicates the agent has found a stable reasoning region. The lookahead trajectories agree on the outcome, suggesting robustness. High variance means the path is sensitive to small perturbations—a red flag for production systems.

Why stability matters. Even with high certainty at the endpoint, the trajectory might be oscillating rather than steadily improving. Slope variance captures trend smoothness. A steadily improving trajectory is preferable to one that jumps around unpredictably.

The combination produces more reliable selections. The paper shows that adding risk signals (α=0.3, β=0.2) improves accuracy by 8.3 points over profit-only scoring.

Benchmark Results

MAXS was evaluated on five reasoning benchmarks using MiMo-VL-7B and Qwen2.5-VL-7B as base models.

MAXS Outperforms All Baselines

Accuracy (%) across five reasoning benchmarks

Grouped Bar ChartSource: Table 1

Main results

Dataset	CoT	ToT	MCTS	MAXS
MathVista	77.2	73.9	75.3	85.5
OlympiadBench	41.6	43.0	26.6	48.5
EMMA	33.3	39.3	29.0	46.7
TheoremQA	46.9	59.3	40.5	61.0
MATH	65.7	69.7	72.7	75.7

Results shown for MiMo-VL-7B. MAXS achieves the highest accuracy on all five benchmarks.

Key observations

MathVista sees the largest gains. MAXS improves from 77.2% (CoT) to 85.5%, an 8.3 point increase. MathVista requires visual reasoning combined with mathematical computation, exactly the type of multi-step problem where lookahead helps most.

MCTS underperforms despite exhaustive search. On OlympiadBench, MCTS achieves only 26.6% compared to MAXS at 48.5%. The exhaustive search strategy dilutes focus across too many branches, while MAXS concentrates compute on promising candidates.

Smaller model shows larger relative improvements. On Qwen2.5-VL-7B (not shown in table), MAXS improves average accuracy from 35.2% (CoT) to 42.9%, a 22% relative gain.

Inference Efficiency

Raw accuracy numbers hide a critical dimension: computational cost. MAXS achieves better results with far fewer tokens.

MAXS: Best Accuracy-to-Cost Ratio

Inference-time scaling: accuracy vs. token consumption

Scatter PlotSource: Figure 4

Token consumption comparison

Method	Total Tokens	Accuracy
CoT	2.67 × 10⁷	52.9%
Guided Decoding	1.67 × 10⁸	47.9%
φ-Decoding	7.66 × 10⁸	54.5%
MAXS	9.86 × 10⁸	63.5%
ToT	6.40 × 10¹⁰	57.0%
MCTS	9.91 × 10¹⁰	48.8%

MAXS uses roughly the same tokens as φ-Decoding but achieves 9 points higher accuracy. Compared to MCTS, MAXS uses 100x fewer tokens while scoring 15 points higher.

In dollar terms: At typical API pricing (~$0.01 per 1K tokens), MCTS costs roughly $1.00 per problem in tokens. MAXS costs about $0.01 per problem—for better results. That's the difference between a viable production system and a research curiosity.

The scaling sweet spot

The scatter plot reveals MAXS occupying an optimal position: top-left quadrant where accuracy is high and token usage is moderate. Tree-based methods (ToT, MCTS) sit in the bottom-right, paying massive token costs for mediocre results.

Inference-Time Scaling

Recent work shows that spending more compute at inference (via longer chains, search, or verification) can improve LLM performance. But the returns diminish quickly. MAXS demonstrates that targeted lookahead is more efficient than exhaustive exploration.

What Actually Drives Performance?

Which components of MAXS are worth building, and which can be skipped to save complexity? The researchers removed each component individually to measure its ROI.

Lookahead Drives the Biggest Gains

Ablation study: impact of removing each MAXS component

Ablation Bar ChartSource: Table 3

Component contributions

Configuration	MiMo-VL-7B	Qwen-7B
MAXS (Full)	63.46%	42.85%
w/o Lookahead	58.50%	33.41%
w/o Advantage	60.96%	36.94%
w/o Step Variance	61.35%	37.67%
w/o Slope Variance	62.41%	38.79%
w/o Trajectory Convergence	63.03%	41.60%

Key findings

Lookahead is critical. Removing lookahead drops accuracy by 5 points on MiMo-VL and nearly 10 points on Qwen. This confirms that seeing future consequences of current decisions is the core mechanism.

Variance signals matter more for weaker models. Removing step variance hurts Qwen-7B by 5 points but MiMo-VL-7B by only 2 points. Weaker models benefit more from the stabilizing effect of variance-based selection.

Trajectory convergence has minimal accuracy impact. The accuracy drop is small (0.43% for MiMo-VL), but this component reduces token usage by stopping early when further lookahead would not change the decision.

Lookahead depth analysis

4-Step Lookahead: Best Cost-Accuracy Balance

Comparing 4-step vs 6-step lookahead depth

Comparison ChartSource: Figure 5

The paper examines whether deeper lookahead (6 steps instead of 4) helps. The answer is barely: 6-step achieves 85.8% vs 4-step's 85.5% on MathVista. But 6-step uses 49.8% more tokens. The cost-benefit ratio favors shallower lookahead.

Tool Strategy Guide

MAXS supports tool use (code execution, web search). The data reveals clear patterns for when to prioritize each tool.

Web Search Matters More Than Code

Accuracy drop when removing each tool capability

Horizontal Bar ChartSource: Figure 6

Overall impact

Configuration	Average Accuracy	Change
Both Tools	63.46%	baseline
No Code	60.81%	-2.65%
No Search	56.36%	-7.10%
No Tools	52.06%	-11.40%

Web search contributes more to overall accuracy than code execution. Removing search drops performance by 7.1 points, while removing code drops it by only 2.65 points.

Task-specific tool importance

The overall numbers mask task-specific dependencies. On MathVista (a mathematical reasoning benchmark):

Configuration	MathVista Accuracy	Change
Full MAXS	85.5%	baseline
No Code	70.8%	-14.7%
No Search	81.4%	-4.1%

Here the pattern reverses. Code execution is critical (14.7 point drop without it), while search contributes less (4.1 points). Mathematical problems require symbolic computation that code excels at.

To see why, consider two concrete examples:

Fact-retrieval question: "What is the capital of Mongolia?" The optimal answer requires a factual lookup. When MAXS has web search enabled, it queries and receives "Ulaanbaatar." Disabling search forces the model to rely on memorized knowledge, which may be incomplete or outdated.
Symbolic math problem: "Compute the determinant of [[2,-1,0],[0,3,-2],[1,0,4]]" Solving this efficiently requires exact arithmetic. With code execution, MAXS generates np.linalg.det(...) and runs it, obtaining the exact answer. Without code, the model must perform the arithmetic mentally, often yielding rounding errors or algebraic mistakes.

Tool selection heuristic

Task Type	Prioritize	Expected Gain	Example
Math/Logic/Symbolic	Code Execution	+15 points	Matrix operations, equation solving
Knowledge/Facts	Web Search	+7 points	Current events, entity lookups
Mixed reasoning	Both Tools	+11 points	Multi-step problems with fact retrieval

Practical implication: Tool availability should match task type. For a math tutoring agent, code execution is non-negotiable. For a research assistant, web search is the priority. For general-purpose agents, enable both.

Implementation Blueprint

Core algorithm

from dataclasses import dataclass
from typing import List, Optional
import numpy as np
 
@dataclass
class MAXSConfig:
    lookahead_depth: int = 4
    num_candidates: int = 4
    num_rollouts: int = 3
    alpha: float = 0.3  # step variance weight
    beta: float = 0.2   # slope variance weight
    convergence_threshold: float = 0.1
 
def maxs_step(
    model: LLMAgent,
    state: ReasoningState,
    tools: List[Tool],
    config: MAXSConfig
) -> Action:
    """Select next action using lookahead + multi-signal scoring."""
 
    # Generate candidate next steps
    candidates = model.generate_candidates(state, n=config.num_candidates)
 
    # Score each candidate via lookahead
    scored = []
    for candidate in candidates:
        # Simulate N future trajectories (can be batched for efficiency)
        trajectories = [
            rollout(model, state.apply(candidate), config.lookahead_depth, tools)
            for _ in range(config.num_rollouts)
        ]
 
        # Extract terminal values
        values = np.array([t.final_value for t in trajectories])
 
        # Three scoring signals
        advantage = np.exp(values.mean() - state.value)  # expected profit
        step_var = np.exp(-values.var())                 # certainty
        slope_var = compute_slope_variance(trajectories) # stability
 
        score = advantage + config.alpha * step_var + config.beta * slope_var
        scored.append((candidate, score, values.var()))
 
        # Dynamic compute: stop early if high certainty
        if values.var() < config.convergence_threshold:
            break
 
    # Return highest-scoring candidate
    return max(scored, key=lambda x: x[1])[0]

Production tip: The inner loop over candidates is embarrassingly parallel. Batch all lookahead rollouts into a single model call to cut latency by 3-4x.

Recommended hyperparameters

Parameter	Value	Notes
Lookahead depth	4	Diminishing returns beyond 4
Number of candidates	3-5	Trade-off between diversity and cost
Rollouts per candidate	2-4	Enough for variance estimation
α (step variance weight)	0.3	Paper-optimized value
β (slope variance weight)	0.2	Paper-optimized value
Convergence threshold δ	0.1	Lower = more aggressive early stopping

Integration considerations

Tool execution during lookahead. The paper executes tools during lookahead rollouts. This adds cost but provides accurate value estimates. For cost-sensitive deployments, consider tool-free lookahead with tool execution only on committed steps.

Model selection. MAXS shows larger relative gains on smaller models. If you are using a 7B model, MAXS is likely worthwhile. For frontier models with strong base reasoning, the gains may be smaller.

Batching. Lookahead rollouts can be batched for efficiency. Generate all candidates, then batch their lookahead trajectories through the model in parallel.

Limitations

Latency cost (be honest about this)

While 100x cheaper than MCTS in tokens, MAXS still adds significant latency. Looking ahead 4 steps means 4-10x longer response times compared to single-stream generation. Each decision point requires multiple forward passes for lookahead simulation.

Where MAXS shines: Offline batch processing, complex queries where accuracy matters more than speed, high-stakes decisions (legal, medical, financial analysis).

Where MAXS struggles: Real-time chat, interactive applications where users expect sub-second responses, simple queries that don't need lookahead.

Value estimation quality

The value estimation depends on the model's ability to judge its own progress. If the base model cannot reliably estimate solution quality, the lookahead signals become noisy.

Tool-dependent gains

Most of the impressive MathVista results come with code execution enabled. Without tools, MAXS still outperforms baselines but by smaller margins. The approach works best when appropriate tools are available.

Limited evaluation scope

The benchmarks focus on mathematical and logical reasoning. Performance on open-ended generation, creative tasks, or dialogue is unexplored.

Original paper: arXiv ・ PDF

Code: GitHub

Authors: Jian Zhang, Zhiyuan Wang, Zhangqi Wang, Yu He, Haoran Luo, Li Yuan, Lingling Zhang, Rui Mao, Qika Lin, Jun Liu

Authors

Jian ZhangUnknown,Zhiyuan WangUnknown,Zhangqi WangUnknown,Yu HeUnknown,Haoran LuoUnknown,Li YuanUnknown,Lingling ZhangUnknown,Rui MaoUnknown,Qika LinUnknown,Jun LiuUnknown

Code & Data

Cite this paper

Jian Zhang, Zhiyuan Wang, Zhangqi Wang, Yu He, Haoran Luo, Li Yuan, Lingling Zhang, Rui Mao, Qika Lin, Jun Liu (2026). MAXS: The 'Measure Twice, Cut Once' Agent Architecture. arXiv 2026.