Youtu-Agent: Scaling Agent Productivity with Automated Generation and Continuous Optimization

TL;DR

Problem. Building high-quality agents requires extensive manual configuration, and deployed agents struggle to adapt without expensive fine-tuning.
Solution. Youtu-Agent automates agent generation through YAML-based configuration and provides two optimization paths: in-context learning (no gradients) and scalable RL training.
Results. 71.47% on WebWalkerQA and 72.8% on GAIA using DeepSeek-V3, 81%+ tool synthesis success, and +5.4% AIME improvements for just $18.

Research overview

Building an LLM agent today is a craftsman's job. You select tools, write integration code, craft prompts, and iterate until something works. Then your deployed agent hits a new scenario and fails because it can't adapt without retraining.

What is an LLM agent?

An LLM agent is a system where a language model acts autonomously to complete tasks. Unlike a chatbot that just responds to prompts, an agent can use tools (search the web, run code, click buttons), plan multi-step actions, and adapt based on feedback from its environment.

Youtu-Agent addresses both problems:

Automated generation reduces configuration overhead by generating tools, prompts, and complete agent configurations from task descriptions.
Continuous optimization lets agents improve through experience, either through in-context learning (no parameter updates) or full reinforcement learning.

The paper demonstrates these capabilities using DeepSeek-V3 and Qwen2.5, though the framework concepts are model-agnostic.

Training-free GRPO

Group Relative Policy Optimization without gradient updates. The agent performs multiple rollouts per task, an evaluator scores each trajectory, and the system distills a "semantic group advantage" by comparing successful vs failed attempts. The result functions like "textual LoRA": learned experiences that guide reasoning at inference time without modifying model weights.

Why YAML configuration?

YAML is a human-readable data format that looks like a structured outline. Instead of writing Python code to configure an agent, you describe what you want in plain text. This makes agent configurations shareable, version-controllable, and automatable. A machine can read the same YAML file and generate working code from it.

The agent development problem

The paper identifies two bottlenecks in current agent development: High Configuration Costs and Static Capabilities.

To understand where Youtu-Agent fits, let's compare it to existing frameworks:

Feature	Youtu-Agent	Others
Tool Creation	Auto-synthesis	Manual
Optimization	Practice + RL	Manual tuning
Config	YAML	Python code

Most developers choose a framework like AutoGen for orchestration but are left on their own for tool implementation and optimization. Youtu-Agent automates those specific pain points.

Youtu-Agent architecture

The framework uses a three-layer modular design that separates concerns and enables component reuse across different agents.

Youtu-Agent Three-Layer Architecture

Modular design enabling automated generation and continuous optimization

Environment Layer: The foundation providing execution context. Typical backends include Playwright for browser automation, E2B for sandboxed code execution, and shell environments for system commands.

Tools Layer: Atomic and composite operations behind stable interfaces. Tools fall into three categories: environment-related (clicking elements, running commands), standalone utilities (math, text processing), and MCP integrations for external services.

Agent Layer: The LLM-driven planner that orchestrates task completion through a perceive-reason-act loop.

What is MCP (Model Context Protocol)?

MCP is a standardized protocol for connecting LLMs to external tools and data sources. Think of it like USB for AI: instead of writing custom integrations for every tool, you expose capabilities through MCP and any compatible agent can use them. This allows Youtu-Agent to integrate synthesized tools seamlessly.

The Context Manager: Preventing Overflow

What is context overflow?

LLMs have a fixed "context window" (the amount of text they can process at once). When an agent works on a long task, its history of actions, observations, and tool outputs can exceed this limit. Without management, either the agent crashes or old information gets blindly truncated, losing critical context.

Agents accumulate history that can quickly overflow context windows. The Context Manager is smarter than a simple rolling window.

Example: In a web search task, the agent visits 5 pages.

Without Manager: The context fills with thousands of lines of raw HTML from all 5 pages.
With Manager: It keeps the initial user query and the final extracted answer, but prunes the raw HTML and intermediate navigation steps of the pages it has finished processing. This keeps the "working memory" clean and costs low.

Automated agent generation

The framework provides two generation paradigms: Workflow Mode (for routine tasks) and Meta-Agent Mode (for complex requests).

Workflow Mode follows a deterministic four-stage pipeline: intent clarification, tool retrieval/synthesis, prompt engineering, and configuration assembly. It achieves 100% configuration validity because the pipeline is fixed.

Meta-Agent Mode deploys a higher-level "architect" agent equipped with four capabilities:

search_tool: Find existing tools in the library
create_tool: Synthesize new Python tools on demand
ask_user: Clarify ambiguous requirements through multi-turn dialog
create_agent_config: Assemble the final YAML configuration

This flexibility lets Meta-Agent handle non-standard requests that would confuse the rigid Workflow pipeline.

From Prompt to Config: A Real Example

To understand the value, look at what the user inputs versus what the system builds.

User Input:

"Create a research agent that can search for daily papers on 'agentic AI', summarize them, and save the PDFs to a local folder."

System Output (Automated):

Identifies intent: Need search, summarization, file I/O.
Checks library: Finds search_tool.
Synthesizes missing tool: Writes a new Python function fetch_daily_papers with requests and BeautifulSoup.
Generates Config:

agent:
  name: Papers_Analyzer_Agent
  instructions: |
    Research assistant for daily
    AI paper analysis...
toolkits:
  search:
    activated_tools: ["search"]
  arxiv:
    activated_tools: ["download_papers"]
  fetch_daily_papers: {}  # Synthesized

Tool synthesis success rate is 81%+, meaning 4 out of 5 times the generated Python code runs correctly on the first try.

The Practice module

How do you improve an agent without fine-tuning? The Practice module uses Training-free GRPO.

The "Game Tape" Analogy

Think of this like a basketball team watching game tapes to improve.

Fine-tuning is like sending players to the gym to build muscle. It changes their physical capability (model weights), but it's expensive and takes time.
The Practice Module is like analyzing past plays. The team watches what worked and what didn't to adjust their strategy. They don't change their bodies, but their playbook (contextual memory) gets smarter.

The $18 vs $10,000 Gap

The cost difference is massive, but so is the performance difference. This is a trade-off, not a free lunch.

The $18 vs $10,000 Gap

Practice module costs 500x less but yields smaller gains (+5% vs +35%)

What you get for $18:

Training Set: 100 problems from DAPO-Math-17K
Learning Cycles: 3 epochs with group comparisons
Result: +5.4% improvement on AIME 2025 benchmarks

What is DAPO-Math-17K?

A curated dataset of 17,000 mathematical problems designed for training AI reasoning capabilities. The "DAPO" approach emphasizes diverse problem types and difficulty levels, making it effective for teaching agents to generalize across mathematical domains.

The paper explicitly states "approximately $18 learning costs" for the Practice module, making this one of the most cost-effective agent improvement methods available.

Agent RL training

For applications requiring significant and lasting performance improvement, the Agent RL module provides end-to-end reinforcement learning.

What is Agent-Lightning?

Agent-Lightning is a connector framework that bridges agent systems with distributed RL training infrastructure like VeRL. Youtu-Agent integrates with Agent-Lightning to enable scalable training across GPU clusters. The paper achieves a 40% speedup compared to the official Agent-Lightning v0.2.2 release.

Infrastructure Reality Check: While the Practice Module runs on a single laptop (calling APIs), the Agent RL Module is enterprise-grade. The paper's experiments used up to 128 GPUs to achieve stable training. This path is for research labs and large tech companies, not individual developers.

Stability solutions

What is entropy explosion?

In RL training, "entropy" measures how random the model's outputs are. Entropy explosion occurs when training destabilizes and the model's policy "collapses" into repetitive, nonsensical outputs. Instead of learning useful behaviors, it gets stuck generating the same garbage tokens over and over.

To prevent entropy explosion, the paper implements three stability fixes:

Filter invalid and anomalous tool calls from training data
Remove batch shuffling and reduce off-policy update iterations
Correct bias of advantage estimation in turn-level GRPO training

RL results (Qwen2.5-7B-Instruct)

The Agent RL module was evaluated on Qwen2.5-7B-Instruct across math and search tasks:

What is AIME?

The American Invitational Mathematics Examination is a prestigious high school math competition. Problems require multi-step reasoning and creative problem-solving. Using AIME as a benchmark tests whether AI agents can handle genuinely difficult mathematical reasoning, not just textbook exercises.

Math benchmarks:

AIME 2024: 10% → 45% (+35%)
AIME 2025: 9% → 31% (+22%)

Search benchmarks (7 datasets):

Dataset	Before	After	Gain
NaturalQuestions	24%	45%	+21%
PopQA	16%	35%	+19%
TriviaQA	37%	54%	+17%
HotpotQA	21%	38%	+17%
Bamboogle	23%	36%	+13%
2WikiMultiHop	22%	32%	+10%
MuSiQue	6%	14%	+8%

Search Task Improvements (Agent RL)

Qwen2.5-7B-Instruct before/after RL training on 7 QA benchmarks

The consistent gains across diverse QA tasks (single-hop and multi-hop) suggest the RL training generalizes beyond the specific training distribution.

Benchmark results

The paper evaluates Youtu-Agent across four dimensions: general agent benchmarks, automated generation quality, Practice module effectiveness, and Agent RL training results.

Benchmark Performance

Youtu-Agent achieves competitive results using only open-source models

General agent benchmarks

The paper validates Youtu-Agent on standard benchmarks:

Benchmark	Task Type	Result
WebWalkerQA	Multi-step web navigation	71.47%
GAIA (text-only)	Real-world QA with tools	72.8%

WebWalkerQA (680 questions) evaluates multi-step web navigation and question answering. The agent must search, crawl, and synthesize information from real websites.

GAIA (466 questions) tests reasoning, web browsing, and tool-use proficiency. Youtu-Agent handles file attachments through document parsing and code execution tools.

Automated generation evaluation

The paper introduces AgentGen-80, a benchmark of 80 diverse task descriptions for evaluating automated agent generation:

Mode	Config	Tools	Task
Workflow	100%	81%	65%
Meta-Agent	99%	83%	69%

Configuration Validity (CV): Whether the YAML is structurally valid. Workflow mode achieves 100% due to its deterministic pipeline.

Tool Executability (TE): Whether synthesized Python tools compile and run. Both modes achieve ~81-82%.

Task Completion (TC): End-to-end success. Meta-Agent mode edges ahead due to its ability to clarify ambiguous requirements.

Two optimization paths compared

Two Paths to Optimization

Practice module for low-cost improvement, Agent RL for maximum performance

The Practice module and Agent RL serve different use cases:

Aspect	Practice Module	Agent RL
Cost	~$18	$10,000+
Samples needed	100	10,000+
AIME improvement	+2.7% to +5.4%	+22% to +35%
Infrastructure	API calls only	128 GPU cluster
Best for	Quick iteration	Production systems

Implementation blueprint

This section combines findings from the paper with practitioner guidance.

Note: As of publication, Youtu-Agent is described in the paper but no public release is available. The workflow below is a hypothetical reconstruction based on the paper's architecture.

Hypothetical workflow

Based on the paper's description, a Youtu-Agent workflow would look like:

# 1. Install framework (hypothetical)
pip install youtu-agent
 
# 2. Generate agent from a prompt
youtu generate "Research agent for daily AI papers"
# -> Generates config.yaml + synthesized tools
 
# 3. Practice (optional, ~$18)
youtu practice --config config.yaml --samples 100
# -> Updates config with learned experiences
 
# 4. Deploy
youtu run --config config.yaml

Recommended stack

Component	Options	Notes
Environment	Playwright, E2B, Shell	Paper uses Playwright for web tasks, E2B for sandboxed code
Base Model	DeepSeek-V3	Best price/performance ratio for this framework
Tool Protocol	MCP	Enables standardized tool integration
RL Framework	VeRL	Only needed if you have a GPU cluster

Key parameters (from paper)

Parameter	Value	Context
Practice epochs	3	Training-free GRPO learning cycles
Group size	55	Rollouts per problem during practice
Training temp	0.7	Higher diversity during learning
Inference temp	0.3	Lower variance during deployment
Training samples	100	Minimum for Practice module

Common pitfalls

1. Hallucinated APIs in tool synthesis

With ~81% success rate, roughly 1 in 5 synthesized tools fail. Common causes include the LLM importing functions that don't exist or using wrong argument signatures. Mitigation: always sandbox synthesized tools and add try-catch wrappers.

2. Context overflow in long tasks

Without the Context Manager, agents on multi-page web tasks will exceed context limits. Make sure to configure pruning for your specific task type.

3. Over-iteration in Practice

The Practice module shows diminishing returns after ~3 epochs. More iterations can lead to overfitting on the training problems without generalizing.

Practical applications

The benchmarks suggest several real-world use cases that align with the evaluated capabilities.

Research automation

Aligns with: WebWalkerQA (web navigation), GAIA (tool-use reasoning)

Automated daily paper collection, summarization, and PDF archival. The case study in the paper demonstrates exactly this workflow with a synthesized fetch_daily_papers tool.

Code-assisted problem solving

Aligns with: AIME benchmarks (mathematical reasoning with code interpreter)

Complex analytical tasks where the agent can write and execute Python code to verify solutions. The +35% improvement from RL training on AIME suggests strong potential for technical problem-solving.

Multi-step web automation

Aligns with: WebWalkerQA (71.47% accuracy on 680 questions)

Tasks requiring navigation across multiple websites, form filling, data extraction, and synthesis. The Context Manager's ability to prune stale HTML makes long-horizon web tasks feasible.

Enterprise QA with attachments

Aligns with: GAIA text-only subset (72.8% accuracy)

Question answering over documents, PDFs, and structured data where the agent needs to parse files and reason across multiple sources.

Desktop automation (Tip)

The paper introduces Tip, an on-device desktop assistant built on Youtu-Agent. Key capabilities:

Runs Youtu-Agent configs for bash and file operations
Automatically captures screen context and surfaces user intent
GUI automation with reusable "skills" and workflows
Local model support for privacy-sensitive environments

This demonstrates practical deployment beyond benchmarks, targeting real desktop productivity tasks.

When to use Youtu-Agent (and when not to)

Use Youtu-Agent when:

You need automated tool creation. Writing custom Python integrations manually is your bottleneck. The 81%+ tool synthesis success rate means 4 out of 5 tools work on first try.
You want low-cost agent improvement. The Practice module's $18 / 100 samples approach beats $10,000+ fine-tuning costs for modest gains (+2.7% to +5.4%).
Your tasks involve web navigation or code execution. The benchmarks show strongest results on WebWalkerQA and AIME.
You want YAML-based configuration. Declarative configs are easier to version, share, and auto-generate than Python orchestration code.

Consider alternatives when:

Your agents don't need continuous improvement. If your use case is static (same tools, same prompts), the optimization modules add unnecessary complexity.
You need enterprise RL training but lack infrastructure. The Agent RL module requires up to 128 GPUs for the results shown. Without that scale, stick to the Practice module.
Tool reliability is critical. With ~81% synthesis success, roughly 1 in 5 tools fail on first try. This may be unacceptable for production systems handling sensitive tasks.
You prefer Python-native orchestration. If YAML configs feel limiting, frameworks like LangGraph offer more programmatic control.

Limitations

Tool Synthesis Failure Modes

With ~81% success, roughly 1 in 5 tool syntheses fail. The failures aren't random. Common causes include:

Hallucinated APIs: The model imports a library function that doesn't exist or uses the wrong arguments.
Syntax Errors: In complex logic, the generated Python code may not parse.
Security Risks: Without a sandbox, synthesized code could delete files or access restricted network paths. Always review generated tools.

Practice module ceiling

In-context learning has limits. "Watching game tape" helps you play smarter, but it won't turn a junior varsity player into LeBron James. For fundamental capability shifts, you still need full model training (RL or SFT).

What is SFT?

Supervised Fine-Tuning trains a model on labeled examples of correct behavior. Unlike RL (which learns from rewards), SFT directly teaches the model "when you see X, output Y." It's faster but requires high-quality training data.

Infrastructure requirements for Agent RL

The paper's RL experiments used up to 128 GPUs. While the Practice module runs on a single machine calling APIs, the Agent RL module is enterprise-grade infrastructure. The 40% speedup improvements and stability fixes require significant compute resources that most teams don't have access to.

Tested primarily on specific models

The paper's experiments use DeepSeek-V3 and Qwen2.5. While the framework concepts are model-agnostic, the prompts, tool synthesis templates, and optimization parameters are tuned for these models. Adapting to other models may require additional calibration.

Paper: arXiv:2512.24615 Authors: Youtu-Agent Team (Tencent Youtu Lab, Fudan University, Xiamen University) Original paper: arXiv ・ PDF ・ HTML

Authors

Yuchen Shi, Yuzheng Cai, Siqi Cai, Zihan Xu et al.Tencent Youtu Lab, Fudan University, Xiamen University

Cite this paper

Yuchen Shi, Yuzheng Cai, Siqi Cai, Zihan Xu et al. (2025). Youtu-Agent: Scaling Agent Productivity with Automated Generation and Continuous Optimization. arXiv 2025.

Key Findings

Research overview

The agent development problem

Youtu-Agent architecture

Youtu-Agent Three-Layer Architecture

The Context Manager: Preventing Overflow

Automated agent generation

From Prompt to Config: A Real Example

The Practice module

The "Game Tape" Analogy

The $18 vs $10,000 Gap

The $18 vs $10,000 Gap

Agent RL training

Stability solutions

RL results (Qwen2.5-7B-Instruct)

Search Task Improvements (Agent RL)

Benchmark results

Benchmark Performance

General agent benchmarks

Automated generation evaluation

Two optimization paths compared

Two Paths to Optimization

Implementation blueprint

Hypothetical workflow

Recommended stack

Key parameters (from paper)

Common pitfalls

Practical applications

Research automation

Code-assisted problem solving

Multi-step web automation

Enterprise QA with attachments

Desktop automation (Tip)

When to use Youtu-Agent (and when not to)

Use Youtu-Agent when:

Consider alternatives when:

Limitations

Tool Synthesis Failure Modes

Practice module ceiling

Infrastructure requirements for Agent RL

Tested primarily on specific models

Authors

Cite this paper

Related Research

PaperScout: Teaching AI to Search Like a Researcher

MAXS: The 'Measure Twice, Cut Once' Agent Architecture

SCRIBE: Structured Mid-Level Supervision for Tool-Using Language Models