-
The Core Insight. Scientific discovery requires iterating on what to optimize, not just how to optimize. SAGA automates the objective evolution loop that makes human scientists effective
-
Bi-Level Architecture. Outer loop (LLM agents) analyzes results and refines objectives. Inner loop (optimizer) maximizes current objectives. This separation prevents reward hacking by catching and correcting systematic failures
-
Real Results. 15 novel stable magnet structures (vs MatterGen's 11), 176% improvement in DNA enhancer design, and antibiotic candidates that balance activity with drug-likeness where baselines fail on one or both
Research Overview
Most AI-driven science research has focused on a deceptively simple question: given an objective function, how do we find optimal solutions? But this framing misses something fundamental. Scientific discovery requires iterating on what to optimize, not just how to optimize.
SAGA (Scientific Autonomous Goal-evolving Agent) addresses this gap with a bi-level framework that automates the evolution of objectives themselves. Instead of treating objective design as a one-time human decision, SAGA makes it a dynamic, autonomous discovery process.
Think of it like a Manager and Worker relationship:
- The Inner Loop (Worker) tries to solve the problem as fast as possible given specific instructions
- The Outer Loop (Manager) reviews the results and changes the instructions if they're not producing what's actually needed
This division mirrors how human scientists work. SAGA automates both roles.
How Scientific Discovery Works Today
The standard workflow for AI-assisted scientific discovery:
- Define - A scientist specifies an objective function ("maximize binding affinity to target protein")
- Optimize - An AI system searches for solutions that score high on that objective
- Review - The scientist examines top candidates, often finding systematic problems
- Reformulate - Based on what went wrong, the scientist manually adjusts the objective
- Repeat - The cycle continues until results are satisfactory or time runs out
Each iteration requires expert review. A drug discovery campaign might need dozens of objective refinements over months. The scientist's bandwidth becomes the limiting factor, not the optimizer's speed.
This is where most AI-for-science efforts stop. They build faster optimizers, better search algorithms, more accurate predictors. But the objective refinement loop stays manual.
For grand challenges in science, objective functions are only imperfect proxies for what we actually want. A drug needs to be effective, but also synthesizable, stable, and safe. Optimizing any single metric often leads to reward hacking: solutions that score well but fail in practice. SAGA's key contribution is automating the objective refinement loop that human scientists perform intuitively.
The Problem with Fixed Objectives
Consider antibiotic design. You might optimize for activity against a target bacteria. But a model optimizing purely for activity will discover molecules that are:
- Chemically unrealistic
- Impossible to synthesize
- Metabolically unstable
- Toxic to human cells
Human scientists iterate: "The activity is good, but we need to add constraints for drug-likeness." Then: "Now it's drug-like, but we need synthesizability." This objective evolution is where the real scientific insight happens. It's exactly what SAGA automates.
Key Results at a Glance
| Domain | SAGA Result | vs. Baseline |
|---|---|---|
| Antibiotic Design | Best activity + drug-likeness balance | Baselines fail on one or both |
| Magnet Discovery | 15 novel stable structures | MatterGen: 11 structures |
| DNA Enhancers | 176% improvement | Up to 48% better specificity |
| Chemical Processes | Balanced multi-objective | Baseline RL reward-hacks |
Why This Paper Matters
Closing the Objective Evolution Gap
There has been unprecedented interest in developing agents that expand the boundary of scientific discovery by optimizing quantitative objective functions. But these objectives are only imperfect proxies for what scientists actually want. Automating objective function design is a central, yet unmet requirement for scientific discovery agents.
SAGA closes this gap. It mimics the two modes of scientific thinking:
- Thinking fast (inner loop): Explore all reachable solutions given specific objectives
- Thinking slow (outer loop): Evolve objectives and preferences based on full optimization results
This is how human scientists work. They don't just run optimizers. They analyze results, identify gaps in their objective formulation, and refine their goals. SAGA makes this process autonomous.
Beyond Reward Hacking
A persistent problem in AI optimization is reward hacking: finding solutions that score well on the objective but fail in practice. SAGA addresses this by treating objectives as hypotheses to be tested and refined, not fixed targets.
When SAGA's analyzer examines optimization results and finds that high-scoring solutions have systematic problems (e.g., chemically unrealistic molecules), it proposes new objectives to address those gaps. This dynamic refinement is fundamentally different from static multi-objective optimization.
The SAGA Framework
SAGA employs a bi-level architecture with four key components:
SAGA Bi-Level Framework
Outer loop evolves objectives, inner loop optimizes
Outer Loop: Objective Evolution
1. Planner Receives the Analyzer's diagnostic report and proposes new or refined objectives. It reads what failed, identifies which constraints are missing, and formulates specific additions: "add a synthesizability penalty" or "include metabolic stability." The Planner doesn't look at raw data; it interprets the Analyzer's findings and decides what to change.
2. Implementer Converts proposed objectives into executable scoring functions by writing actual Python code. For chemistry tasks, it wraps tools like RDKit for molecular property calculations or docking simulators for binding affinity. For materials, it interfaces with DFT (density functional theory) calculators. The Implementer researches available computational tools, writes the integration code, and deploys it in Docker containers. This turns abstract goals like "make it stable" into concrete mathematical scores the optimizer can maximize.
3. Analyzer Examines optimization results and produces a structured diagnosis: What worked? What failed systematically? Where is reward hacking occurring? Its output (a detailed report of gaps between high-scoring solutions and practical requirements) becomes the direct input to the Planner for the next iteration.
Inner Loop: Solution Optimization
4. Optimizer Employs established algorithms (genetic algorithms, reinforcement learning, LLM-based evolution) to maximize the currently specified objectives. This is the "thinking fast" component that efficiently explores the solution space.
The key insight is separation of concerns. The inner loop doesn't need to understand scientific goals. It just optimizes whatever objectives it's given. The outer loop doesn't need to search solution spaces. It just analyzes results and refines objectives. This modularity allows each component to be specialized and effective.
The Objective Evolution Cycle
- Initialize: Start with basic objectives derived from the scientific goal
- Optimize: Inner loop (Optimizer) finds solutions that maximize current objectives
- Analyze: Analyzer examines results, produces diagnostic report of systematic failures
- Plan: Planner reads the Analyzer's report, proposes new or modified objectives
- Implement: Implementer converts new objectives into executable scoring functions
- Repeat: Continue until objectives stabilize or resources are exhausted
Three Automation Levels
SAGA operates at three levels, allowing scientists to choose their degree of involvement:
Three Automation Levels
From human-guided to fully autonomous discovery
Co-Pilot Mode
Scientists collaborate with the planner and analyzer. The implementer and optimizer run autonomously.
Best for: Early-stage exploration where domain expertise is crucial for objective formulation.
Semi-Pilot Mode
Human feedback is limited to the analyzer stage. The planner operates independently based on analyzer outputs.
Best for: Scaling up discovery once the objective space is reasonably understood.
Autopilot Mode
All four modules operate fully autonomously without human intervention.
Best for: High-throughput screening across large design spaces where human review of every iteration is impractical.
The paper finds that co-pilot mode often achieves the best results on novel problems, while autopilot mode excels at scaling known approaches. Semi-pilot offers a middle ground where human insight guides analysis without bottlenecking the iteration cycle.
Results Across Four Domains
SAGA Results Across Scientific Domains
Performance improvement over baselines (100% = baseline)
Antibiotic Design
Target: Design antibiotics effective against drug-resistant Klebsiella pneumoniae.
Challenge: Baselines either fail to optimize activity OR achieve high activity with chemically unrealistic molecules. The AlphaEvolve baseline shows a "catastrophic drop in medicinal chemistry quality" despite high activity scores. It found molecules that would theoretically kill bacteria but couldn't exist as stable compounds, would never survive metabolism, or would be toxic to human cells. High scores, useless candidates.
SAGA's Approach: Dynamically adds objectives like synthesizability penalties and metabolic stability filters based on analyzing population-level trends. When the analyzer detects that high-activity molecules fail drug-likeness thresholds, it proposes additional objectives.
Result: SAGA achieves the best balance between biological activity and drug-likeness. Discovered molecules occupy diverse chemical regions distinct from over 500 known antibiotics, suggesting genuinely novel candidates rather than minor variations.
Inorganic Materials Design
Target: Discover stable structures for permanent magnets with low supply-chain risk, and superhard materials.
Magnet Design Results:
- Co-pilot mode: 15 novel stable structures within 200 DFT calculations
- MatterGen baseline (Microsoft Research's leading generative model for materials): 11 structures
- This 36% improvement over a leading industry model comes from SAGA's ability to refine stability objectives based on DFT feedback
Superhard Materials Results:
- All SAGA modes outperform TextGrad baseline across five normalized metrics
- Over 90% of proposed crystals contain light elements (boron, carbon, nitrogen, oxygen), aligning with experimental evidence that these elements are essential for hardness
- The analyzer learned to emphasize light element composition after seeing that heavier-element candidates failed hardness thresholds
Functional DNA Sequence Design
Target: Design cell-type-specific enhancers for HepG2 liver cells.
Results:
- Surpasses baselines by 176% at most and 19% at minimum on average
- MPRA specificity improvement: minimum 48%
- Motif enrichment improvement: minimum 47%
- Recovers multiple liver-specific transcription factor motifs
Why SAGA Won: The analyzer identified that baseline objectives optimized for enhancer activity without penalizing off-target activation in other cell types. SAGA added specificity objectives that the baselines lacked.
Chemical Process Design
Target: Design optimal flowsheets for chemical separation processes.
Problem Discovered: Baseline RL optimizing only for product purity leads to unnecessarily complex flowsheets. The optimizer learns to add unit operations that have no separation effect but don't hurt purity. This is a classic form of reward hacking.
SAGA's Solution: Autonomously adds objectives for capital costs and material flow intensity. This prevents the reward hacking by penalizing unnecessary complexity.
Result: Significantly improved process performance while maintaining near-ideal product purity. The final designs are both effective and economically practical.
Business Implications
For Pharmaceutical Companies
Accelerated Lead Optimization: SAGA's ability to balance multiple drug-likeness objectives simultaneously could significantly reduce the time from hit to lead compound. The antibiotic results suggest it can find candidates that pass multiple filters in early stages rather than failing late in development.
Reduced Wet Lab Iterations: By refining objectives computationally, SAGA can eliminate many candidates that would otherwise consume wet lab resources before being rejected for drug-likeness issues.
For Materials Science R&D
Faster Discovery Cycles: The 36% improvement over MatterGen in stable structure discovery translates directly to faster time-to-candidate for new materials.
Supply Chain Awareness: SAGA's ability to incorporate supply chain risk objectives (demonstrated in magnet design) addresses a growing concern for materials-dependent industries.
For Biotech Startups
Accessible High-Throughput Design: The three automation levels mean smaller teams can leverage SAGA in co-pilot mode with their domain expertise, scaling to semi-pilot or autopilot as they build confidence in the objective space.
IP Differentiation: SAGA's ability to find genuinely novel chemical scaffolds (as shown in antibiotic design) could strengthen patent positions compared to derivative approaches.
For Scientific AI Tooling
Framework for Agent Design: SAGA's bi-level architecture provides a template for building scientific agents in other domains. The separation of objective evolution from solution optimization is broadly applicable.
Benchmark for Future Work: The four-domain evaluation establishes benchmarks that future scientific AI systems can be measured against.
Limitations
Computational Verification Dependency
SAGA relies on computationally verifiable objectives. For problems where results cannot be validated computationally (e.g., certain biological assays), SAGA needs extensions:
- Human-in-the-loop scoring: Scientists provide feedback on candidates
- Lab-in-the-loop integration: Autonomous labs run experiments to evaluate candidates
Predefined Design Spaces
Current high-level goals predefine the design space. SAGA cannot flexibly adjust the search space from abstract goals alone. If the initial design space excludes the optimal region, SAGA won't find it.
Iteration Costs
The outer loop involves LLM calls for planning and analysis, plus potentially expensive objective implementation. For very fast inner-loop optimizers, the outer-loop overhead may dominate. The paper doesn't provide detailed cost breakdowns.
That said, LLM inference costs have dropped by orders of magnitude over the past two years and continue to fall. As this trend continues, the outer-loop overhead becomes less significant, making SAGA increasingly viable for cost-sensitive applications.
Generalization Unknown
Results are demonstrated on four carefully chosen domains. Whether SAGA's objective evolution strategy transfers to fundamentally different scientific problems remains to be established.
Conclusion
SAGA changes how AI-driven scientific discovery works. By automating the evolution of objectives (not just the optimization of solutions), it addresses a fundamental bottleneck that has limited previous approaches.
Key Takeaways:
-
Objective evolution is the key: Iterating on what to optimize matters more than faster optimization of fixed objectives
-
Bi-level architecture works: Separating "thinking fast" (optimization) from "thinking slow" (objective refinement) enables effective automation
-
Three automation levels: Co-pilot, semi-pilot, and autopilot modes allow appropriate human involvement for different scenarios
-
Cross-domain validation: Results across antibiotics, materials, DNA, and chemical processes demonstrate broad applicability
-
Addresses reward hacking: Dynamic objective refinement catches and corrects systematic optimization failures
For practitioners, SAGA suggests a new approach to scientific AI: focus less on building better optimizers and more on automating the objective refinement loop that makes human scientists effective.
Original paper: arXiv ・ PDF ・ HTML
Authors: Yuanqi Du, Botao Yu, Tianyu Liu, Tony Shen, Junwu Chen, Jan G. Rittig, Kunyang Sun, Yikun Zhang, Zhangde Song, Bo Zhou, Cassandra Masschelein, Yingze Wang, Haorui Wang, Haojun Jia, Chao Zhang, Hongyu Zhao, Martin Ester, Teresa Head-Gordon, Carla P. Gomes, Huan Sun, Chenru Duan, Philippe Schwaller, Wengong Jin
Institutions: Cornell University, Ohio State University, Yale University, MIT, EPFL, and others
Cite this paper
Yuanqi Du, Botao Yu, Wengong Jin (2025). SAGA: Autonomous Goal-Evolving Agents for Scientific Discovery. arXiv 2025.