-
The Problem. Multi-turn agent training wastes signal: when a 10-turn conversation fails, current methods assign the same penalty to all turns, even if only turn 7 was wrong.
-
The Solution. SWEET-RL trains a critic that sees the ground-truth solution during training, enabling it to assign step-level rewards. The agent never sees the solution, keeping deployment realistic.
-
The Results. Llama-3.1-8B matches GPT-4o (40.4% success) and beats O1-Mini (30.3%) on collaborative coding. 6% absolute gain over previous best method. Official code from Meta AI.
Before (GPT-4o API): 10M tokens/month × $0.01/token = $100K/month
After (8B model + SWEET-RL): 10M tokens/month × $0.001/token = $10K/month
Result: $90K saved monthly (90% reduction), same 40.4% task success rate.
Research Overview
If you have trained a multi-turn agent, you know this frustration: the agent has a productive 9-turn conversation, then says something wrong in turn 10, and the whole trajectory gets marked as failure. All 9 good turns? Penalized. The training signal is wasted.
This is the credit assignment problem for multi-turn agents. Current methods like DPO compare entire trajectories: this 10-turn conversation succeeded, that one failed. But they cannot pinpoint which specific turn made the difference. It is like grading a student's exam by looking only at the final answer, not the work.
Direct Preference Optimization (DPO) is a popular method for fine-tuning LLMs on preference data. Instead of training a separate reward model, DPO directly optimizes the policy using pairs of "chosen" (better) and "rejected" (worse) responses. It is simpler than RLHF but struggles with multi-turn credit assignment.
SWEET-RL (Step-WisE Evaluation from Training-time information) solves this with an asymmetric actor-critic design. The key insight: during training, you often have access to the "right answer" (reference code, target output, solution). Why not let the critic see it?
The critic sees both the conversation history AND the solution. This lets it judge whether each turn moves toward or away from the goal. The actor (the agent you deploy) never sees the solution, keeping everything realistic.
The result? Llama-3.1-8B trained with SWEET-RL matches GPT-4o on collaborative coding tasks (40.4% vs 40.4%) and beats O1-Mini (30.3%). A 6% absolute improvement over Multi-Turn DPO, the previous best method.
The Credit Assignment Problem
Consider this 5-turn coding collaboration where the agent makes good progress but introduces a bug in turn 4:
Before vs After: How Training Signal Changes
| Turn | Action | Traditional (Multi-Turn DPO) | SWEET-RL |
|---|---|---|---|
| 1 | Correct function signature | -1 (penalized) | +0.8 (rewarded) |
| 2 | Core logic implemented | -1 (penalized) | +0.9 (rewarded) |
| 3 | Edge cases handled | -1 (penalized) | +0.7 (rewarded) |
| 4 | Bug introduced | -1 (penalized) | -0.6 (penalized) |
| 5 | Failed to fix bug | -1 (penalized) | -0.8 (penalized) |
| Result | Tests fail | All turns punished equally | Good turns reinforced, bad turns corrected |
With trajectory-level feedback (Traditional), turns 1-3 are penalized despite being correct. The model learns "this whole approach was wrong" instead of "your bug was in step 4." You need exponentially more examples to distinguish good patterns from bad patterns when signal is diluted this way.
Why step-level feedback matters:
- Faster learning: Every turn provides useful gradient, not just the final outcome
- Better generalization: Good patterns are reinforced even in failed trajectories
- Clearer debugging: You can trace exactly which steps the model struggles with
SWEET-RL Architecture
SWEET-RL operates in two stages:
Stage 1: Train the Critic
The critic learns to assign advantages (how good is this action?) to each turn. It trains on preference pairs from offline trajectories:
- Take two trajectories for the same task
- Label the one with higher cumulative reward as "chosen"
- Train the critic to prefer chosen trajectories using Bradley-Terry objective
A probabilistic model for pairwise comparisons. Given two items, it predicts the probability that one is preferred over the other based on their latent scores. The loss encourages the model to assign higher scores to the "chosen" trajectory. It is the same objective used in RLHF reward model training.
The key: the critic sees training-time information (the reference solution) that the actor never sees.
SWEET-RL: Asymmetric Actor-Critic Architecture
Critic uses privileged information to assign step-level rewards
Architecture Diagram Explanation: The chart above shows the asymmetric information flow. During training, conversation turns flow to both the Critic (which also receives the ground truth solution) and the Actor (which only sees history). The Critic produces step-level advantage scores that guide the Actor's learning. At deployment, only the Actor runs—the Critic is discarded.
Stage 2: Train the Actor
With the trained critic, optimize the actor policy:
- At each turn, sample 16 candidate responses
- Rank them by critic advantage scores
- Top 50% become "chosen", bottom 50% become "rejected"
- Apply DPO loss on these turn-level preferences
This converts the trajectory-level signal into step-level signal, enabling much more efficient learning.
Critic Parameterization
The advantage function uses a clever parameterization:
A(observation, action, context) =
(1/L) * sum(log π_trained / log π_reference)
Where L is the number of tokens (for normalization). This is the mean log probability ratio between the trained and reference models.
Without the 1/L normalization, agents degenerate to generating minimal responses (shorter = higher advantage). The paper shows success rate drops from 40.4% to 3.6% without normalization.
The Asymmetric Trick
Think of SWEET-RL as a teacher (critic) with an answer key grading a student's (actor's) work. The teacher sees the correct solution and can tell the student "Your approach in step 3 is leading you astray" without revealing the answer. The student learns which steps work and which don't, without ever seeing the answer key. At test time, the student solves problems independently.
The core innovation is asymmetric information between actor and critic:
| Component | Sees History | Sees Solution | Purpose |
|---|---|---|---|
| Actor | Yes | No | Generate responses |
| Critic | Yes | Yes | Judge quality |
Why this works:
- The critic can make informed judgments because it knows where the conversation should go
- The actor learns from these judgments without ever seeing the solution
- At deployment, the actor works without any privileged information
Training-Time Information by Industry
The "privileged information" the critic sees varies by domain. Here is what it looks like across industries:
| Industry | Training-Time Info (Critic Sees) | Success Metric | Typical Data Source |
|---|---|---|---|
| Software Dev | Reference code, test cases | Unit test pass rate | GitHub repos, internal codebase |
| Customer Support | Expert resolution transcript | Resolution rate, CSAT | Ticket history, QA samples |
| Legal Tech | Verified contract clauses | Compliance score | Reviewed documents |
| Healthcare | Clinical guidelines, diagnosis | Accuracy vs specialist | Medical records (de-identified) |
| E-commerce | Purchase completion path | Conversion rate | Transaction logs |
| Education | Correct solution steps | Quiz performance | Tutor session recordings |
The key requirement: you must have verifiable ground truth during training. At deployment, the agent never sees this information.
ColBench Benchmark
The paper introduces ColBench, a benchmark for multi-turn collaborative agents:
Tasks
| Task | Description | Turns | Metric |
|---|---|---|---|
| Backend Programming | Write Python functions through conversation | 10 | Unit test pass rate |
| Frontend Design | Create HTML pages through conversation | 10 | CLIP similarity to reference |
Scale
- Training: 10K+ backend tasks, 10K frontend tasks
- Testing: 1K backend, 500 frontend (manually inspected)
- Offline trajectories: 15K backend, 6K frontend
Why ColBench?
- Sufficient complexity: Tasks require multi-step reasoning, not single-turn solutions
- Minimal engineering overhead: LLM simulators with ground-truth artifacts, no complex environments
- Procedural generation: Prevents overfitting to specific examples
The benchmark is open-source and included in the official repository.
Benchmark Results
ColBench Results: Llama-3.1-8B-Instruct
SWEET-RL achieves 6% gain over Multi-Turn DPO, matching GPT-4o
Main Results (Llama-3.1-8B-Instruct)
| Method | Backend Success | Frontend Win Rate |
|---|---|---|
| Zero-Shot | 22.4% | 33.8% |
| Rejection Fine-Tuning | 28.2% | 38.6% |
| Multi-Turn DPO | 34.4% | 42.8% |
| SWEET-RL | 40.4% | 48.2% |
SWEET-RL achieves 6% absolute improvement over Multi-Turn DPO.
Comparison to Larger Models
| Model | Backend Success |
|---|---|
| O1-Mini | 30.3% |
| Llama-3.1-70B Zero-Shot | 35.0% |
| GPT-4o | 40.4% |
| Llama-8B + SWEET-RL | 40.4% |
An 8B model matches GPT-4o and beats O1-Mini. The 8B model with SWEET-RL also matches the 70B model with zero-shot prompting.
Off-Policy Learning (70B Model)
A training setup where data comes from a different policy (e.g., a smaller model) than the one being optimized. It lets you improve a target model using trajectories collected elsewhere—critical for production where you want to reuse existing logs.
SWEET-RL's off-policy capability is a game-changer for production: collect trajectories with your fast, cheap 8B model in production, then use that data to train your premium 70B model. You get the best of both worlds: cheap data collection at scale, plus a high-quality model for critical tasks. This is not possible with imitation learning approaches.
SWEET-RL works even when training a larger model on data from a smaller model:
| Method | Success Rate |
|---|---|
| Zero-Shot | 35.0% |
| Rejection Fine-Tuning | 31.9% (worse!) |
| Multi-Turn DPO | 41.8% |
| SWEET-RL | 45.6% |
Rejection Fine-Tuning actually hurts performance because it forces the 70B model to imitate suboptimal 8B trajectories word-by-word. SWEET-RL avoids this by using step-level rewards instead of direct imitation.
Ablation Studies
What Makes SWEET-RL Work?
| Configuration | Success Rate |
|---|---|
| SWEET-RL (full) | 40.4% |
| Without training-time info | 31.2% |
| With regression head | 36.2% |
| Without normalization | 3.6% |
Three critical components:
- Training-time information: Without it, the critic cannot judge step quality. 9 percentage points lost.
- Log probability parameterization: Using a regression head instead loses 4 points. LLMs generalize better with their native output format.
- Length normalization: Without it, the model degenerates completely.
Credit Assignment Comparison
The paper compares different ways to assign step-level rewards:
| Method | Best-of-16 Success |
|---|---|
| SWEET-RL | Best scaling |
| LLM-as-Judge | Limited improvement |
| Value function head | Poor generalization |
| No training-time info | Significantly degraded |
LLM-as-Judge (using another LLM to rate responses) gets distracted by length and format rather than actual quality. Value function heads (standard in RL) generalize poorly to new tasks.
Data Scaling
SWEET-RL requires more initial data to train a reliable critic:
- At 3K samples: Multi-Turn DPO outperforms SWEET-RL
- At 6K+ samples: SWEET-RL catches up and surpasses
- At 15K samples: SWEET-RL significantly ahead
If you have limited data, Multi-Turn DPO may be a better choice. With sufficient data (6K+), SWEET-RL is clearly superior.
Implementation Blueprint
When to Use SWEET-RL
Use SWEET-RL when you have:
- Multi-turn tasks (3+ turns of interaction)
- Ground-truth artifacts for training (reference code, target outputs, solution paths)
- Sufficient data (6K+ trajectory pairs, more is better)
- Clear success criteria (unit tests, similarity scores, verifiable outcomes)
Do not use SWEET-RL for:
- Single-turn tasks (use regular DPO)
- Tasks without ground truth (use RLHF or LLM-as-Judge)
- Limited data scenarios (< 3K examples, use Multi-Turn DPO)
Data Qualification Checklist
Before starting SWEET-RL implementation, assess your readiness:
Data Volume Threshold
■ Below 3K trajectories: Use Multi-Turn DPO instead ■ 3K–6K trajectories: SWEET-RL may underperform DPO ■ 6K+ trajectories: SWEET-RL recommended (15K+ optimal)
Required Checkpoints
[✓] Answer Key: Ground-truth solution exists for every task (reference code, expert transcript, target output)
[✓] Multi-Turn: Tasks average 3+ conversation turns
[✓] Success Metric: Automated verification available (unit tests, similarity scores, pass/fail labels)
[✓] Outcome Mix: Dataset includes both successful (30-50%) and failed (50-70%) trajectories
[✓] Role Labels: Conversation turns clearly marked (user/assistant/system)
Score interpretation: If you cannot check all boxes, consider Multi-Turn DPO (needs only success labels) or standard fine-tuning. Missing the 6K+ volume threshold is the most common blocker.
Tech Stack
| Component | Recommended | Notes |
|---|---|---|
| Base model | Llama-3.1-8B-Instruct | Paper's primary model |
| Critic model | Same architecture as actor | Or VLM for multimodal (Qwen2-VL-7B) |
| Training framework | Official repo | Includes ColBench |
| Compute | 8x A100 GPUs | Estimated for 8B model |
Step-by-Step Workflow
Phase 1: Data Collection
1. Define your task with clear success criteria
2. Collect 10K+ task instances with ground truth
3. Generate trajectories using your base model
4. Filter for both successful and failed attempts
5. Target: 15K+ trajectories (mix of outcomes)
Phase 2: Critic Training
1. Construct preference pairs:
- Same task, different trajectories
- Higher cumulative reward = chosen
2. Train critic with Bradley-Terry objective
3. Include training-time info (ground truth) as input
4. Validate on held-out preference pairs
Phase 3: Actor Training
1. For each turn in each trajectory:
- Sample 16 candidate responses
- Score with trained critic
- Top 8 = chosen, bottom 8 = rejected
2. Apply DPO loss on turn-level preferences
3. Repeat for multiple epochs
Key Parameters
| Parameter | Value | Notes |
|---|---|---|
| Candidate samples per turn | 16 | For Best-of-N during training |
| Chosen/rejected split | 50/50 | Top half vs bottom half |
| Min training trajectories | 6,000 | Below this, use Multi-Turn DPO |
| Max turns per trajectory | 10 | ColBench default |
| Length normalization | Required | Divide advantage by token count |
Adaptation to Your Domain
Customer Support Bot:
- Ground truth: Expert response transcripts
- Success metric: Resolution rate + customer satisfaction
- Turns: Typically 5-15
Code Assistant:
- Ground truth: Correct code implementation
- Success metric: Unit test pass rate
- Turns: Typically 3-10
Tutoring System:
- Ground truth: Ideal explanation path
- Success metric: Student quiz performance
- Turns: Typically 5-20
Limitations
-
Data requirements: SWEET-RL needs more data than Multi-Turn DPO to train an effective critic. At 3K samples, it underperforms.
-
Ground truth dependency: You need training-time information (reference solutions). For tasks without clear ground truth, this is not applicable.
-
Offline only: The paper tests offline RL (learning from collected data). Online learning with human feedback remains expensive.
-
Domain tested: Only validated on artifact creation (code, webpages). Generalization to other domains (dialogue, planning) is untested.
-
Compute cost: Training both critic and actor requires roughly 2x the compute of standard fine-tuning.
If you lack 6K+ trajectories, bootstrap your dataset: use GPT-4 to generate initial trajectories for your tasks, then label success/failure with your automated tests or human review. This gives you the data volume needed to train the SWEET-RL critic. As your own model improves, gradually replace GPT-4 trajectories with self-generated ones.
The Bottom Line
For ML engineers building multi-turn agents:
SWEET-RL is the current best method if you have:
- 6K+ training trajectories
- Ground truth artifacts for training
- Multi-turn tasks (3+ turns)
The 6% improvement over Multi-Turn DPO is significant, and matching GPT-4o with an 8B model means major cost savings at deployment.
For engineering managers:
Budget 2-3 weeks for integration. The official Facebook Research code provides a solid starting point. ROI calculation: if your 8B model can match GPT-4o quality, you save ~10x on inference costs at scale.
For researchers:
The asymmetric actor-critic design (critic sees privileged information, actor does not) is a generalizable pattern. The key insight is that training-time information should inform credit assignment, not just final outcomes.
Official resources:
- Paper: arxiv.org/abs/2503.15478
- Code: github.com/facebookresearch/sweet_rl
- Benchmark: ColBench (included in repo)
Cite this paper
Yifei Zhou, Song Jiang, Yuandong Tian, Jason Weston, Sergey Levine, Sainbayar Sukhbaatar, Xian Li (2025). SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks. arXiv 2025.