-
The Problem. LLM-based reward models for tool-using agents give inconsistent signals because they evaluate entire reasoning chains without distinguishing planning errors from execution errors.
-
The Solution. SCRIBE introduces a mid-level abstraction layer: Skill Prototypes capture common reasoning patterns, and a Router maps each step to the appropriate prototype for structured evaluation.
-
The Results. On Qwen3-4B: MATH500 89.1% to 95.8% (+6.7pp), AIME25 43.3% to 63.3% (+20.0pp), BFCL v4 33.0% to 51.3% (+18.3pp). The approach generalizes across mathematical reasoning and function calling.
Research Overview
If you have ever built a tool-using agent, you know the frustration: the agent fails, but you do not know why. Did it misunderstand the goal? Did it pick the wrong tool? Or did it make a logic error somewhere in between?
Most agent frameworks treat reasoning as a black box. You give a prompt, you get a result. If the result is wrong, you tweak the prompt and hope for the best. This is like managing a junior developer by only checking their final code, never their logic.
Tool use (or function calling) lets an LLM invoke external functions during a conversation. The model decides when to call a tool, generates the arguments, and receives the result. This enables agents to search the web, run code, query databases, or perform calculations.
SCRIBE changes this by treating reasoning not as magic, but as a structured workflow.
Instead of letting the model wander through a problem, SCRIBE forces it to use defined "Skill Prototypes": reusable reasoning patterns like "Check Constraints," "Synthesize Data," or "Verify Bounds." It effectively gives the agent a checklist for every step of the problem.
The result? An agent that does not just guess the right tools, but follows a verifiable, auditable process. This architecture boosted success rates on complex math problems by 20 percentage points and improved function-calling reliability by 18pp, proving that the secret to better agents is not just bigger models, but better supervision.
The Missing Middle
Most work on tool-using agents focuses on either high-level planning or low-level tool execution. SCRIBE identifies the overlooked middle layer.
High-level: Strategic planning and problem decomposition. "To solve this optimization problem, I'll use Lagrange multipliers."
Mid-level: Skill execution and subgoal completion. "Apply the constraint, differentiate, solve the system of equations." This is where SCRIBE operates.
Low-level: Individual tool calls and their execution. "calculator(derivative(f, x))" returning "2x + 3".
Put simply: Strategy (high) → Tactics (mid) → Execution (low). SCRIBE supervises the tactics.
Three-Level Reasoning Hierarchy
SCRIBE targets the overlooked mid-level between planning and execution
Before vs After: Where standard agents fail
Consider this math problem: "Find the largest integer c such that c < 44.75 and c satisfies the constraint."
High-level: "I need to find the largest integer below 44.75."
Low-level: calculator(floor(44.75)) returns 44
Output: "The answer is 44."
Problem: The agent skipped checking whether 44 actually satisfies the constraint. A standard PRM sees "correct final answer" and rewards this trace, missing the logical gap.
High-level: "I need to find the largest integer below 44.75."
Mid-level (Skill: Bound-Based Conclusion): "Collect bounds, check constraint satisfaction, discretize properly."
Low-level: calculator(floor(44.75)) then verify_constraint(44)
Output: "The answer is 44, verified against the constraint."
Why it works: The Router maps this step to the "Bound-Based Conclusion" prototype, which requires explicit constraint verification. The rubric catches agents that skip this step.
The problem with standard PRMs is that they evaluate entire trajectories without this hierarchical decomposition. An LLM judge sees the full reasoning chain and makes a holistic judgment. This works when trajectories are short, but falls apart for multi-step tool use where errors can occur at any level.
Architecture
SCRIBE decomposes the training process into three stages: skill-level abstraction, structured reward evaluation, and policy optimization.
SCRIBE Training Pipeline
Mid-level skill supervision with prototype-guided reasoning and GRPO optimization
Stage 1: Mid-level skill abstraction
The framework formalizes trajectories as sequences of (subgoal, skill, step) triples. Each step is a non-overlapping span of the reasoning trace.
To build this abstraction, SCRIBE:
- Decomposes trajectories into subgoals and associated skills using an LLM
- Applies hierarchical clustering (HDBSCAN for dense clusters, K-means as fallback) to group semantically similar patterns
- Distills a representative Skill Prototype for each cluster
A density-based clustering algorithm that automatically determines the number of clusters and finds clusters of varying shapes. Unlike K-means (which requires specifying cluster count upfront), HDBSCAN groups tightly packed points together while marking sparse points as noise. This makes it well-suited for discovering coherent Skill Prototypes from messy reasoning traces where the "right" number of skills is unknown.
Skill Prototypes aggregate usage contexts, intermediate objectives, and canonical reasoning patterns while abstracting away instance-specific details like concrete numerical values.
Stage 2: Structured verification
Skill Prototypes transform subjective, open-ended judging into structured verification. Once a step is routed to a prototype, the LLM judge receives a precise, checklist-style rubric.
For example, instead of vaguely rewarding "logical correctness," a prototype for bound-based reasoning requires verifying:
- Did the agent properly address inequality strictness?
- Did it identify necessary bounds?
- Did it avoid common traps like "boundary leak" (failing to discretize a continuous bound)?
Each prototype defines Common Traps and maps them to specific penalty scores. This granularity ensures the reward signal reflects skill mastery rather than outcome bias.
Stage 3: Policy optimization
The model is optimized using GRPO (Group Relative Policy Optimization) based on skill-conditioned rewards. To handle distributional shifts, SCRIBE implements an adaptive refresh: every 1,000 training steps, it re-clusters accumulated trajectories to update the Skill Prototype library.
Skill Prototypes
Skill Prototypes are the core innovation. They convert noisy LLM evaluation into grounded verification.
A Skill Prototype is a learned template that describes a common reasoning pattern. It includes: when to use the skill, the expected steps, scoring rubrics, and common mistakes (traps) with their penalties. Think of it as a detailed checklist for evaluating a specific type of reasoning.
Example: Skill Prototype card
Here's what the Router actually retrieves when it identifies a "Bound-Based Conclusion" step. This is the structured data that replaces vague LLM judgment:
ID: skill_bound_conclusion_v2
Trigger: Subgoal terminates a reasoning chain by concluding from established bounds or constraints.
Expected Pattern:
- Collect all relevant bounds from prior steps
- Check whether tightness/feasibility must be addressed
- Apply appropriate discretization (continuous → integer)
Scoring Rubric:
| Score | Criteria |
|---|---|
| 3 | Rigorous conclusion, tightness addressed |
| 2 | Correct conclusion, minor slip |
| 1 | Major logical gap |
| 0 | Wrong or premature conclusion |
Common Trap: "Boundary leak" (concluding c < 44.75 without stating c ≤ 44). Auto-penalty: Score capped at 2.
Confidence Threshold: 0.85 (Router must exceed this confidence to apply this prototype; below threshold triggers fallback to generic evaluation)
This card is what the LLM judge receives instead of "evaluate whether this reasoning is correct." The judge now has concrete checkboxes: Did the agent collect bounds? Did it discretize? Did it avoid the boundary leak trap?
Prototype discovery
The paper uses HDBSCAN clustering on skill embeddings to discover prototypes automatically. For math domains, this yielded 418 initial clusters, growing to 424 after training converged. The prototypes capture patterns like:
- Algebraic manipulation
- Constraint optimization
- Bound-based reasoning
- Numerical approximation
- Case analysis
The Router
The Router is a lightweight model (Qwen3-4B in the paper) that maps raw trajectories into structured (subgoal, skill, step) triples during RL training.
Think of the Router as a triage nurse in an emergency room. When a patient (reasoning step) arrives, the triage nurse quickly assesses symptoms and assigns them to the appropriate specialist (Skill Prototype). The nurse does not treat the patient directly. That separation is critical: the "doctor" (Policy Model) focuses entirely on treatment (reasoning), while the nurse handles classification. If the doctor also had to do triage, they might learn to game the system by classifying cases as "easy" rather than actually solving hard problems.
The Router allows the main policy model to focus on reasoning while a specialized model handles skill classification. This separation prevents the policy from gaming the reward by learning to classify rather than reason.
Router mapping example
Given this trace for "Find largest int c < 44.75 where c mod 5 = 3":
1. "Need largest int < 44.75"
2. floor(44.75) → 44
3. check(44 mod 5) → False
4. try(44 - 1) → 43
5. check(43 mod 5) → True
The Router outputs structured triples:
| Subgoal | Skill | Step |
|---|---|---|
| Identify bound | Bound Identification | Line 1 |
| Compute floor | Tool Invocation | Line 2 |
| Check constraint | Bound-Based Conclusion | Line 3 |
| Adjust candidate | Iterative Refinement | Line 4 |
| Verify solution | Bound-Based Conclusion | Line 5 |
Each skill maps to a Prototype with its own rubric. The "Bound-Based Conclusion" prototype (used twice) checks whether the agent properly discretized and verified constraints.
Router accuracy
The paper reports Router evaluation on held-out trajectories. Step segmentation achieves 91.2% span-level exact match, skill prediction reaches 94.7%, and prototype retrieval hits 98.6%. The high prototype retrieval accuracy means the Router reliably connects student-generated subgoals to the appropriate evaluation rubrics.
GRPO Training
SCRIBE uses GRPO (Group Relative Policy Optimization) rather than standard PPO or DPO.
GRPO samples multiple outputs for each prompt and computes rewards relative to the group. Unlike PPO (which requires a value function) or DPO (which needs preference pairs), GRPO directly optimizes the policy using group-normalized rewards. This is the same optimization technique popularized by DeepSeek-V3 and DeepSeek-R1 for reasoning models, confirming its effectiveness for complex chain-of-thought tasks.
GRPO in action
For each training prompt, the policy generates 5 candidate solutions. Each receives a combined score: outcome reward (1 if correct, 0 otherwise) plus skill-prototype scores (0-3 per step).
Suppose the raw scores are: [3.5, 2.8, 3.0, 1.9, 2.2]
GRPO normalizes by subtracting the group mean (2.68):
- Candidate 1: 3.5 → +0.82 (strongest positive update)
- Candidate 2: 2.8 → +0.12
- Candidate 3: 3.0 → +0.32
- Candidate 4: 1.9 → -0.78 (negative update)
- Candidate 5: 2.2 → -0.48
The policy gradient pushes toward candidates that score above average and away from those below. This group-relative signal eliminates the need for a separate value network (PPO) or pairwise preference data (DPO).
Reward weighting
The paper finds optimal performance with 0.3 process-level (mid-level skill scores) and 0.7 outcome-level (final answer correctness) reward weighting.
Process-level: Scores from individual Skill Prototypes (e.g., "how well did the agent perform this bound-checking step?"). Evaluates intermediate reasoning quality.
Outcome-level: Whether the final answer is correct (1 or 0). Captures end-to-end success but provides sparse signal.
Pure process supervision over-constrains the model (some valid reasoning paths look unconventional). Pure outcome supervision gives no signal on why an answer was wrong. The 0.3/0.7 blend balances granular feedback with end-goal focus.
Co-evolution dynamics
A key finding: improving mid-level execution stability precedes and enables high-level planning improvements, despite no explicit planning supervision.
The "learning to drive" analogy: You cannot effectively plan a route across town (high-level) until you have mastered the subconscious skills of steering and braking (mid-level). SCRIBE observes the same pattern: mid-level skill mastery comes first, then high-level planning improves as a consequence.
Phase 1 (Novice): "Turn wheel 15°, check mirror, ease brake..."
→ High cognitive load, errors frequent
Phase 2 (Expert): "Take Highway 101 to avoid traffic"
→ Steering is automatic, bandwidth freed for planning
The paper measures this through:
- Mid-level uncertainty: Entropy of skill predictions (decreases during training)
- Plan selection ability: How well the model chooses correct approaches
- Plan separability: How distinguishable correct vs incorrect plans are
Mid-level uncertainty drops first, followed by gains in plan selection. The levels co-evolve without direct optimization of planning.
Benchmark Results
SCRIBE achieves consistent improvements across mathematical reasoning and function calling using only ~10k training problems. For context, many agent fine-tuning approaches require hundreds of thousands or millions of examples. This data efficiency matters for teams building custom agents on proprietary tasks where labeled data is expensive.
SCRIBE Benchmark Results on Qwen3-4B
Mid-level supervision yields consistent gains across math reasoning and tool use
Mathematical reasoning (Qwen3-4B)
| Benchmark | Base → SCRIBE | Gain |
|---|---|---|
| MATH500 | 89.1 → 95.8 | +6.7pp |
| AIME25 | 43.3 → 63.3 | +20.0pp |
The AIME25 improvement (+20 percentage points) is particularly notable. AIME problems require multi-step reasoning where mid-level skill coordination matters most. SCRIBE also outperforms intermediate baselines like PRM (51.7) and EGPO (48.3).
Tool use (BFCL v4)
| Category | Base → SCRIBE | Gain |
|---|---|---|
| Overall | 33.0 → 51.3 | +18.3pp |
| Multi-step | 19.4 → 33.3 | +13.9pp |
| Single-step | 46.6 → 69.3 | +22.7pp |
SCRIBE outperforms strong baselines including step-level PRM and EGPO across all categories. The multi-step improvements demonstrate that structured mid-level supervision helps with complex tool orchestration.
Smaller models
On LLaMA-3.2-3B, SCRIBE improves MATH500 from 48.3% (PRM baseline) to 63.4%, and BFCL Overall from 24.8% to 30.8%. This suggests the approach can work on smaller models, though only LLaMA-3.2-3B was explicitly tested beyond Qwen3-4B.
Ablation Studies
Complementarity with low-level optimization
SCRIBE combines additively with FunRL (a low-level tool optimization method):
| Method | BFCL Overall |
|---|---|
| Baseline | 22.0% |
| FunRL only | 25.2% |
| SCRIBE only | 29.1% |
| FunRL + SCRIBE | 33.4% |
The gains stack because mid-level skill coordination and low-level tool reliability are orthogonal optimization targets.
Reward weighting ablation
| Process Weight | MATH500 | AIME25 |
|---|---|---|
| 0.0 (outcome only) | 93.4 | 55.0 |
| 0.1 | 94.2 | 58.3 |
| 0.3 | 95.8 | 63.3 |
| 0.5 | 95.0 | 61.7 |
| 0.7 | 94.4 | 58.3 |
The sweet spot is 0.3 process / 0.7 outcome. Too much process supervision over-constrains the model; too little loses the mid-level signal.
Implementation Blueprint
You are building a "grading system" for your agent. Instead of vaguely asking "was this reasoning good?", you create specific checklists (Skill Prototypes) and a classifier (Router) that matches each reasoning step to the right checklist. Then you train your agent using those structured grades.
What you need
| Component | Choice | What it does |
|---|---|---|
| Policy Model | Qwen3-4B | The agent you are training |
| Router | Qwen3-4B | Classifies reasoning steps |
| Clustering | HDBSCAN | Groups similar steps into skills |
| Optimization | GRPO | Trains using relative rewards |
| Judge | LLM-based | Scores steps against rubrics |
Key numbers
- Prototype refresh: Every 1,000 training steps (skills evolve as the model improves)
- Reward blend: 30% step-by-step scores, 70% final answer correctness
- Initial skills: Expect ~400 distinct patterns for math domains
- Scoring: 0-3 scale for each skill (0 = wrong, 3 = rigorous)
Step-by-step workflow
Step 1: Generate example reasoning traces. Run your base model on ~500 problems. Save the full chain-of-thought outputs, including any tool calls. These are your "seed trajectories." Think of this as collecting homework assignments before you create the grading rubric.
Step 2: Break each trace into labeled chunks. Use an LLM to split each trace into individual reasoning steps. For each step, identify: what subgoal it accomplishes, what skill it uses (e.g., "bound checking"), and the exact text span. This is like a teacher marking where each part of a student's work begins and ends.
Step 3: Discover recurring patterns (Skill Prototypes). Feed all the labeled steps into a clustering algorithm. Steps that look similar (e.g., all "check if value satisfies constraint" steps) get grouped together. Each cluster becomes a Skill Prototype with its own grading rubric. You are building the answer key.
Step 4: Train the Router. The Router is a small model that learns to classify new reasoning steps into the skill categories you discovered. When your agent writes "let me verify 44 mod 5", the Router should recognize this as a "constraint verification" step.
Step 5: Train with structured feedback. Run GRPO. For each problem, generate multiple candidate solutions. The Router classifies each step, the LLM judge scores it against the matching rubric, and the combined scores determine which candidates get reinforced. Better reasoning patterns get higher rewards.
Step 6: Update the skill library periodically. As the model improves, its reasoning patterns change. Re-cluster every 1,000 steps to keep your Skill Prototypes aligned with what the model actually does.
Code example: Step 2 (decomposition)
The decomposition step is where you convert raw reasoning into structured data. Here is a production-ready approach using Pydantic:
from pydantic import BaseModel
from typing import Literal
class ReasoningStep(BaseModel):
subgoal: str # What is being accomplished
skill: Literal[
"algebraic_manipulation",
"bound_checking",
"case_analysis",
"tool_invocation",
"constraint_verification"
]
step: str # Exact text span from trace
class TraceDecomposition(BaseModel):
steps: list[ReasoningStep]
# Use with instructor or OpenAI structured outputs
response = client.chat.completions.create(
model="gpt-4o",
response_format=TraceDecomposition,
messages=[{
"role": "user",
"content": f"Decompose: {trace}"
}]
)The LLM output is then clustered to discover Skill Prototypes. Start with a few hundred seed trajectories to bootstrap the prototype library.
What you need to run this
- RL training setup: A framework that supports GRPO (or you can adapt PPO). Libraries like TRL or custom PyTorch loops work.
- Clustering pipeline: scikit-learn's HDBSCAN is sufficient. Run offline between training phases.
- LLM API access: For the judge that scores steps against rubrics. Budget for ~5x your training token volume.
- Two models in memory: The policy and the Router run simultaneously during training. Plan GPU memory accordingly.
Common mistakes
The Router slows everything down. Every reasoning step needs a Router call during training. If you are already GPU-bound, this becomes a bottleneck. Fix: batch Router calls across trajectories, run it asynchronously, or shrink the Router to 0.5B parameters with INT8 quantization.
Your skill categories are too specific or too vague. Clustering hyperparameters matter. If you set min_cluster_size too high, you get generic skills like "do math." Too low, and you get hyper-specific skills like "multiply by 7." Start with defaults, then tune on a validation set.
Garbage in, garbage out. Your Skill Prototypes come from your seed trajectories. If your base model writes bad reasoning, you will discover "bad reasoning patterns" and build rubrics that reward them. Use a stronger model (or human examples) to generate seeds, even if your target model is smaller.
The agent learns to game the classifier. If your policy model can influence which skill gets assigned, it may learn to "look like" an easy skill to get higher scores. The paper prevents this by keeping the Router separate. If you merge them for convenience, you break the system.
Business Implications
SCRIBE's technical improvements translate directly to operational benefits for teams deploying tool-using agents.
Reduced token costs
When agents fail, they retry. Each retry costs tokens. The +18pp improvement on multi-step BFCL tasks means agents get stuck in loops less often.
Concrete example: A customer service agent handling 10,000 queries/day with 3 tool calls per query:
- Before SCRIBE: 30% retry rate → 9,000 extra API calls/day
- After SCRIBE: 15% retry rate → 4,500 extra API calls/day
- Savings: ~4,500 calls/day → ~1.6M fewer tokens/month
At $0.01 per 1K tokens, that is $16K/year in reduced API costs for a single high-volume agent. But the real cost driver is avoiding "human-in-the-loop" fallback: those 4,500 daily errors that previously escalated to human agents represent ~75 hours/day of support staff time (at 1 minute per ticket). At $25/hour, that is $680K/year in labor costs avoided.
Faster debugging
Standard agent failures produce opaque errors: "The agent gave a wrong answer." SCRIBE's structured evaluation pinpoints where the failure occurred.
Concrete example:
- Before SCRIBE: Engineers spend 2-3 hours per incident tracing logs, guessing whether the fault was planning, skill execution, or tool invocation
- After SCRIBE: The Skill Prototype audit log shows "Bound-Based Conclusion scored 1/3 due to boundary leak" → investigation takes 10-15 minutes
- Savings: ~80% reduction in debugging time per incident
Auditability for regulated industries
Financial services, healthcare, and legal applications require explainable AI decisions. Skill Prototypes create an audit trail: "The agent used the Constraint Optimization skill with score 3/3, then the Bound-Based Conclusion skill with score 2/3 due to a boundary leak." This structured log is far more useful for compliance teams than "the model produced output X."
Lower barrier to custom agents
The ~10k training examples requirement (vs. millions for some approaches) means smaller teams can train domain-specific agents on proprietary data.
Concrete example:
- Standard RLHF/fine-tuning: Requires 500K-1M annotated examples → weeks of annotation, $50K+ labeling cost
- SCRIBE approach: ~10K curated examples → ~1 day of annotation, ~$500 labeling cost
- Savings: ~99% reduction in data requirements
A legal tech startup with 5,000 annotated contract analysis examples could feasibly fine-tune their own SCRIBE-style agent, rather than relying solely on prompt engineering with a general-purpose model.
When to consider SCRIBE-style training
The approach makes sense when:
- Your agent performs multi-step reasoning (3+ tool calls per task)
- Reliability matters more than raw speed (customer-facing, regulated)
- You have access to ~10k labeled examples in your domain
- Debugging agent failures is currently painful
It is likely overkill for simple single-turn agents or applications where occasional failures are acceptable.
Limitations
The paper acknowledges several constraints.
Reliance on clustering heuristics
Skill Prototype discovery depends on clustering quality. Poor clusters lead to poor rubrics. The paper uses HDBSCAN with K-means fallback, but optimal clustering for new domains may require tuning.
Evaluation scope
Primary evaluation is on Qwen3-4B and LLaMA-3.2-3B. Transfer to larger models (70B+) is not demonstrated. The mid-level abstraction may need adjustment for models with different reasoning patterns.
Domain specificity
Testing focuses on mathematical reasoning and structured function calling. Open-ended generation tasks (creative writing, open-domain QA) are not evaluated. The skill prototype approach may be less applicable to domains without clear reasoning patterns.
No public code
As of publication, code is not released. Implementation requires reproducing the pipeline from paper descriptions.
The Bottom Line
SCRIBE proves that we do not need smarter models; we need better teachers. The bottleneck in tool-using agents is not raw capability but structured feedback. For practitioners, the move is clear: stop prompting for "better reasoning" and start architecting "better evaluation." Mid-level supervision is the missing piece.
Paper: arXiv:2601.03555 Authors: Yuxuan Jiang, Francis Ferraro (University of Maryland, Baltimore County) Original paper: arXiv ・ PDF ・ HTML
Cite this paper
Yuxuan Jiang, Francis Ferraro (2026). SCRIBE: Structured Mid-Level Supervision for Tool-Using Language Models. arXiv 2026.