SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks

TL;DR

The Problem. Multi-turn agent training wastes signal: when a 10-turn conversation fails, current methods assign the same penalty to all turns, even if only turn 7 was wrong.
The Solution. SWEET-RL trains a critic that sees the ground-truth solution during training, enabling it to assign step-level rewards. The agent never sees the solution, keeping deployment realistic.
The Results. Llama-3.1-8B matches GPT-4o (40.4% success) and beats O1-Mini (30.3%) on collaborative coding. 6% absolute gain over previous best method. Official code from Meta AI.

Executive ROI: Run GPT-4o quality agents at 1/10th the cost

Before (GPT-4o API): 10M tokens/month × $0.01/token = $100K/month

After (8B model + SWEET-RL): 10M tokens/month × $0.001/token = $10K/month

Result: $90K saved monthly (90% reduction), same 40.4% task success rate.

Research Overview

If you have trained a multi-turn agent, you know this frustration: the agent has a productive 9-turn conversation, then says something wrong in turn 10, and the whole trajectory gets marked as failure. All 9 good turns? Penalized. The training signal is wasted.

This is the credit assignment problem for multi-turn agents. Current methods like DPO compare entire trajectories: this 10-turn conversation succeeded, that one failed. But they cannot pinpoint which specific turn made the difference. It is like grading a student's exam by looking only at the final answer, not the work.

What is DPO?

Direct Preference Optimization (DPO) is a popular method for fine-tuning LLMs on preference data. Instead of training a separate reward model, DPO directly optimizes the policy using pairs of "chosen" (better) and "rejected" (worse) responses. It is simpler than RLHF but struggles with multi-turn credit assignment.

SWEET-RL (Step-WisE Evaluation from Training-time information) solves this with an asymmetric actor-critic design. The key insight: during training, you often have access to the "right answer" (reference code, target output, solution). Why not let the critic see it?

The critic sees both the conversation history AND the solution. This lets it judge whether each turn moves toward or away from the goal. The actor (the agent you deploy) never sees the solution, keeping everything realistic.

The result? Llama-3.1-8B trained with SWEET-RL matches GPT-4o on collaborative coding tasks (40.4% vs 40.4%) and beats O1-Mini (30.3%). A 6% absolute improvement over Multi-Turn DPO, the previous best method.

The Credit Assignment Problem

Consider this 5-turn coding collaboration where the agent makes good progress but introduces a bug in turn 4:

Before vs After: How Training Signal Changes

Turn	Action	Traditional (Multi-Turn DPO)	SWEET-RL
1	Correct function signature	-1 (penalized)	+0.8 (rewarded)
2	Core logic implemented	-1 (penalized)	+0.9 (rewarded)
3	Edge cases handled	-1 (penalized)	+0.7 (rewarded)
4	Bug introduced	-1 (penalized)	-0.6 (penalized)
5	Failed to fix bug	-1 (penalized)	-0.8 (penalized)
Result	Tests fail	All turns punished equally	Good turns reinforced, bad turns corrected

The Wasted Signal Problem

With trajectory-level feedback (Traditional), turns 1-3 are penalized despite being correct. The model learns "this whole approach was wrong" instead of "your bug was in step 4." You need exponentially more examples to distinguish good patterns from bad patterns when signal is diluted this way.

Why step-level feedback matters:

Faster learning: Every turn provides useful gradient, not just the final outcome
Better generalization: Good patterns are reinforced even in failed trajectories
Clearer debugging: You can trace exactly which steps the model struggles with

SWEET-RL Architecture

SWEET-RL operates in two stages:

Stage 1: Train the Critic

The critic learns to assign advantages (how good is this action?) to each turn. It trains on preference pairs from offline trajectories:

Take two trajectories for the same task
Label the one with higher cumulative reward as "chosen"
Train the critic to prefer chosen trajectories using Bradley-Terry objective

What is the Bradley-Terry objective?

A probabilistic model for pairwise comparisons. Given two items, it predicts the probability that one is preferred over the other based on their latent scores. The loss encourages the model to assign higher scores to the "chosen" trajectory. It is the same objective used in RLHF reward model training.

The key: the critic sees training-time information (the reference solution) that the actor never sees.

SWEET-RL: Asymmetric Actor-Critic Architecture

Critic uses privileged information to assign step-level rewards

Architecture Diagram Explanation: The chart above shows the asymmetric information flow. During training, conversation turns flow to both the Critic (which also receives the ground truth solution) and the Actor (which only sees history). The Critic produces step-level advantage scores that guide the Actor's learning. At deployment, only the Actor runs—the Critic is discarded.

Stage 2: Train the Actor

With the trained critic, optimize the actor policy:

At each turn, sample 16 candidate responses
Rank them by critic advantage scores
Top 50% become "chosen", bottom 50% become "rejected"
Apply DPO loss on these turn-level preferences

This converts the trajectory-level signal into step-level signal, enabling much more efficient learning.

Critic Parameterization

The advantage function uses a clever parameterization:

A(observation, action, context) =
  (1/L) * sum(log π_trained / log π_reference)

Where L is the number of tokens (for normalization). This is the mean log probability ratio between the trained and reference models.

Critical: Length Normalization

Without the 1/L normalization, agents degenerate to generating minimal responses (shorter = higher advantage). The paper shows success rate drops from 40.4% to 3.6% without normalization.

The Asymmetric Trick

The Teacher-Student Analogy

Think of SWEET-RL as a teacher (critic) with an answer key grading a student's (actor's) work. The teacher sees the correct solution and can tell the student "Your approach in step 3 is leading you astray" without revealing the answer. The student learns which steps work and which don't, without ever seeing the answer key. At test time, the student solves problems independently.

The core innovation is asymmetric information between actor and critic:

Component	Sees History	Sees Solution	Purpose
Actor	Yes	No	Generate responses
Critic	Yes	Yes	Judge quality

Why this works:

The critic can make informed judgments because it knows where the conversation should go
The actor learns from these judgments without ever seeing the solution
At deployment, the actor works without any privileged information

Training-Time Information by Industry

The "privileged information" the critic sees varies by domain. Here is what it looks like across industries:

Industry	Training-Time Info (Critic Sees)	Success Metric	Typical Data Source
Software Dev	Reference code, test cases	Unit test pass rate	GitHub repos, internal codebase
Customer Support	Expert resolution transcript	Resolution rate, CSAT	Ticket history, QA samples
Legal Tech	Verified contract clauses	Compliance score	Reviewed documents
Healthcare	Clinical guidelines, diagnosis	Accuracy vs specialist	Medical records (de-identified)
E-commerce	Purchase completion path	Conversion rate	Transaction logs
Education	Correct solution steps	Quiz performance	Tutor session recordings

The key requirement: you must have verifiable ground truth during training. At deployment, the agent never sees this information.

ColBench Benchmark

The paper introduces ColBench, a benchmark for multi-turn collaborative agents:

Tasks

Task	Description	Turns	Metric
Backend Programming	Write Python functions through conversation	10	Unit test pass rate
Frontend Design	Create HTML pages through conversation	10	CLIP similarity to reference

Scale

Training: 10K+ backend tasks, 10K frontend tasks
Testing: 1K backend, 500 frontend (manually inspected)
Offline trajectories: 15K backend, 6K frontend

Why ColBench?

Sufficient complexity: Tasks require multi-step reasoning, not single-turn solutions
Minimal engineering overhead: LLM simulators with ground-truth artifacts, no complex environments
Procedural generation: Prevents overfitting to specific examples

The benchmark is open-source and included in the official repository.

Benchmark Results

ColBench Results: Llama-3.1-8B-Instruct

SWEET-RL achieves 6% gain over Multi-Turn DPO, matching GPT-4o

BackendFrontendGPT-4o baseline

Main Results (Llama-3.1-8B-Instruct)

Method	Backend Success	Frontend Win Rate
Zero-Shot	22.4%	33.8%
Rejection Fine-Tuning	28.2%	38.6%
Multi-Turn DPO	34.4%	42.8%
SWEET-RL	40.4%	48.2%

SWEET-RL achieves 6% absolute improvement over Multi-Turn DPO.

Comparison to Larger Models

Model	Backend Success
O1-Mini	30.3%
Llama-3.1-70B Zero-Shot	35.0%
GPT-4o	40.4%
Llama-8B + SWEET-RL	40.4%

An 8B model matches GPT-4o and beats O1-Mini. The 8B model with SWEET-RL also matches the 70B model with zero-shot prompting.

Off-Policy Learning (70B Model)

What is off-policy learning?

A training setup where data comes from a different policy (e.g., a smaller model) than the one being optimized. It lets you improve a target model using trajectories collected elsewhere—critical for production where you want to reuse existing logs.

Production Insight: Train Large Models on Small Model Data

SWEET-RL's off-policy capability is a game-changer for production: collect trajectories with your fast, cheap 8B model in production, then use that data to train your premium 70B model. You get the best of both worlds: cheap data collection at scale, plus a high-quality model for critical tasks. This is not possible with imitation learning approaches.

SWEET-RL works even when training a larger model on data from a smaller model:

Method	Success Rate
Zero-Shot	35.0%
Rejection Fine-Tuning	31.9% (worse!)
Multi-Turn DPO	41.8%
SWEET-RL	45.6%

Rejection Fine-Tuning actually hurts performance because it forces the 70B model to imitate suboptimal 8B trajectories word-by-word. SWEET-RL avoids this by using step-level rewards instead of direct imitation.

Ablation Studies

What Makes SWEET-RL Work?

Configuration	Success Rate
SWEET-RL (full)	40.4%
Without training-time info	31.2%
With regression head	36.2%
Without normalization	3.6%

Three critical components:

Training-time information: Without it, the critic cannot judge step quality. 9 percentage points lost.
Log probability parameterization: Using a regression head instead loses 4 points. LLMs generalize better with their native output format.
Length normalization: Without it, the model degenerates completely.

Credit Assignment Comparison

The paper compares different ways to assign step-level rewards:

Method	Best-of-16 Success
SWEET-RL	Best scaling
LLM-as-Judge	Limited improvement
Value function head	Poor generalization
No training-time info	Significantly degraded

LLM-as-Judge (using another LLM to rate responses) gets distracted by length and format rather than actual quality. Value function heads (standard in RL) generalize poorly to new tasks.

Data Scaling

SWEET-RL requires more initial data to train a reliable critic:

At 3K samples: Multi-Turn DPO outperforms SWEET-RL
At 6K+ samples: SWEET-RL catches up and surpasses
At 15K samples: SWEET-RL significantly ahead

If you have limited data, Multi-Turn DPO may be a better choice. With sufficient data (6K+), SWEET-RL is clearly superior.

Implementation Blueprint

When to Use SWEET-RL

Use SWEET-RL when you have:

Multi-turn tasks (3+ turns of interaction)
Ground-truth artifacts for training (reference code, target outputs, solution paths)
Sufficient data (6K+ trajectory pairs, more is better)
Clear success criteria (unit tests, similarity scores, verifiable outcomes)

Do not use SWEET-RL for:

Single-turn tasks (use regular DPO)
Tasks without ground truth (use RLHF or LLM-as-Judge)
Limited data scenarios (< 3K examples, use Multi-Turn DPO)

Data Qualification Checklist

Before starting SWEET-RL implementation, assess your readiness:

SWEET-RL Readiness Scorecard

Data Volume Threshold

■ Below 3K trajectories: Use Multi-Turn DPO instead ■ 3K–6K trajectories: SWEET-RL may underperform DPO ■ 6K+ trajectories: SWEET-RL recommended (15K+ optimal)

Required Checkpoints

[✓] Answer Key: Ground-truth solution exists for every task (reference code, expert transcript, target output)

[✓] Multi-Turn: Tasks average 3+ conversation turns

[✓] Success Metric: Automated verification available (unit tests, similarity scores, pass/fail labels)

[✓] Outcome Mix: Dataset includes both successful (30-50%) and failed (50-70%) trajectories

[✓] Role Labels: Conversation turns clearly marked (user/assistant/system)

Score interpretation: If you cannot check all boxes, consider Multi-Turn DPO (needs only success labels) or standard fine-tuning. Missing the 6K+ volume threshold is the most common blocker.

Tech Stack

Component	Recommended	Notes
Base model	Llama-3.1-8B-Instruct	Paper's primary model
Critic model	Same architecture as actor	Or VLM for multimodal (Qwen2-VL-7B)
Training framework	Official repo	Includes ColBench
Compute	8x A100 GPUs	Estimated for 8B model

Step-by-Step Workflow

Phase 1: Data Collection

1. Define your task with clear success criteria
2. Collect 10K+ task instances with ground truth
3. Generate trajectories using your base model
4. Filter for both successful and failed attempts
5. Target: 15K+ trajectories (mix of outcomes)

Phase 2: Critic Training

1. Construct preference pairs:
   - Same task, different trajectories
   - Higher cumulative reward = chosen
2. Train critic with Bradley-Terry objective
3. Include training-time info (ground truth) as input
4. Validate on held-out preference pairs

Phase 3: Actor Training

1. For each turn in each trajectory:
   - Sample 16 candidate responses
   - Score with trained critic
   - Top 8 = chosen, bottom 8 = rejected
2. Apply DPO loss on turn-level preferences
3. Repeat for multiple epochs

Key Parameters

Parameter	Value	Notes
Candidate samples per turn	16	For Best-of-N during training
Chosen/rejected split	50/50	Top half vs bottom half
Min training trajectories	6,000	Below this, use Multi-Turn DPO
Max turns per trajectory	10	ColBench default
Length normalization	Required	Divide advantage by token count

Adaptation to Your Domain

Customer Support Bot:

Ground truth: Expert response transcripts
Success metric: Resolution rate + customer satisfaction
Turns: Typically 5-15

Code Assistant:

Ground truth: Correct code implementation
Success metric: Unit test pass rate
Turns: Typically 3-10

Tutoring System:

Ground truth: Ideal explanation path
Success metric: Student quiz performance
Turns: Typically 5-20

Limitations

Data requirements: SWEET-RL needs more data than Multi-Turn DPO to train an effective critic. At 3K samples, it underperforms.
Ground truth dependency: You need training-time information (reference solutions). For tasks without clear ground truth, this is not applicable.
Offline only: The paper tests offline RL (learning from collected data). Online learning with human feedback remains expensive.
Domain tested: Only validated on artifact creation (code, webpages). Generalization to other domains (dialogue, planning) is untested.
Compute cost: Training both critic and actor requires roughly 2x the compute of standard fine-tuning.

Cold Start Solution: Bootstrap with GPT-4

If you lack 6K+ trajectories, bootstrap your dataset: use GPT-4 to generate initial trajectories for your tasks, then label success/failure with your automated tests or human review. This gives you the data volume needed to train the SWEET-RL critic. As your own model improves, gradually replace GPT-4 trajectories with self-generated ones.

The Bottom Line

For ML engineers building multi-turn agents:

SWEET-RL is the current best method if you have:

6K+ training trajectories
Ground truth artifacts for training
Multi-turn tasks (3+ turns)

The 6% improvement over Multi-Turn DPO is significant, and matching GPT-4o with an 8B model means major cost savings at deployment.

For engineering managers:

Budget 2-3 weeks for integration. The official Facebook Research code provides a solid starting point. ROI calculation: if your 8B model can match GPT-4o quality, you save ~10x on inference costs at scale.

For researchers:

The asymmetric actor-critic design (critic sees privileged information, actor does not) is a generalizable pattern. The key insight is that training-time information should inform credit assignment, not just final outcomes.

Official resources:

Paper: arxiv.org/abs/2503.15478
Code: github.com/facebookresearch/sweet_rl
Benchmark: ColBench (included in repo)

Authors

Yifei ZhouUC Berkeley,Song JiangMeta AI,Yuandong TianMeta AI,Jason WestonMeta AI,Sergey LevineUC Berkeley,Sainbayar SukhbaatarMeta AI,Xian LiMeta AI

Code & Data

Cite this paper

Yifei Zhou, Song Jiang, Yuandong Tian, Jason Weston, Sergey Levine, Sainbayar Sukhbaatar, Xian Li (2025). SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks. arXiv 2025.