Tekta.ai LogoTektaai
arXiv 2025March 19, 2025

SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks

Yifei Zhouet al.

When your multi-turn agent fails after 10 conversation turns, which turn caused it? Current training methods cannot tell. They treat the entire conversation as pass/fail, wasting signal from partially-correct attempts. SWEET-RL fixes this with a clever trick: give the critic access to the solution during training so it can judge each step, while keeping the agent blind to it. The result? An 8B model that matches GPT-4o on collaborative coding tasks.

Categories:AI AgentsReinforcement LearningLarge Language Models

Key Findings

1

Solves the 'which turn broke it?' problem: assigns credit to individual steps in multi-turn conversations instead of treating the whole trajectory as pass/fail

2

Small models match large ones: Llama-3.1-8B trained with SWEET-RL matches GPT-4o and exceeds O1-Mini on collaborative coding tasks

3

6% absolute improvement over Multi-Turn DPO, the previous best method for training multi-turn agents

4

Asymmetric training trick: the critic sees the solution during training (to judge steps), but the agent never does (to stay realistic)

5

Works with off-policy data: can train a 70B model using trajectories collected from an 8B model, still getting improvements

6

Open-source benchmark (ColBench) and code from Meta AI enables direct reproduction and extension

TL;DR
  1. The Problem. Multi-turn agent training wastes signal: when a 10-turn conversation fails, current methods assign the same penalty to all turns, even if only turn 7 was wrong.

  2. The Solution. SWEET-RL trains a critic that sees the ground-truth solution during training, enabling it to assign step-level rewards. The agent never sees the solution, keeping deployment realistic.

  3. The Results. Llama-3.1-8B matches GPT-4o (40.4% success) and beats O1-Mini (30.3%) on collaborative coding. 6% absolute gain over previous best method. Official code from Meta AI.

Executive ROI: Run GPT-4o quality agents at 1/10th the cost

Before (GPT-4o API): 10M tokens/month × $0.01/token = $100K/month

After (8B model + SWEET-RL): 10M tokens/month × $0.001/token = $10K/month

Result: $90K saved monthly (90% reduction), same 40.4% task success rate.

Research Overview

If you have trained a multi-turn agent, you know this frustration: the agent has a productive 9-turn conversation, then says something wrong in turn 10, and the whole trajectory gets marked as failure. All 9 good turns? Penalized. The training signal is wasted.

This is the credit assignment problem for multi-turn agents. Current methods like DPO compare entire trajectories: this 10-turn conversation succeeded, that one failed. But they cannot pinpoint which specific turn made the difference. It is like grading a student's exam by looking only at the final answer, not the work.

What is DPO?

Direct Preference Optimization (DPO) is a popular method for fine-tuning LLMs on preference data. Instead of training a separate reward model, DPO directly optimizes the policy using pairs of "chosen" (better) and "rejected" (worse) responses. It is simpler than RLHF but struggles with multi-turn credit assignment.

SWEET-RL (Step-WisE Evaluation from Training-time information) solves this with an asymmetric actor-critic design. The key insight: during training, you often have access to the "right answer" (reference code, target output, solution). Why not let the critic see it?

The critic sees both the conversation history AND the solution. This lets it judge whether each turn moves toward or away from the goal. The actor (the agent you deploy) never sees the solution, keeping everything realistic.

The result? Llama-3.1-8B trained with SWEET-RL matches GPT-4o on collaborative coding tasks (40.4% vs 40.4%) and beats O1-Mini (30.3%). A 6% absolute improvement over Multi-Turn DPO, the previous best method.

The Credit Assignment Problem

Consider this 5-turn coding collaboration where the agent makes good progress but introduces a bug in turn 4:

Before vs After: How Training Signal Changes

TurnActionTraditional (Multi-Turn DPO)SWEET-RL
1Correct function signature-1 (penalized)+0.8 (rewarded)
2Core logic implemented-1 (penalized)+0.9 (rewarded)
3Edge cases handled-1 (penalized)+0.7 (rewarded)
4Bug introduced-1 (penalized)-0.6 (penalized)
5Failed to fix bug-1 (penalized)-0.8 (penalized)
ResultTests failAll turns punished equallyGood turns reinforced, bad turns corrected
The Wasted Signal Problem

With trajectory-level feedback (Traditional), turns 1-3 are penalized despite being correct. The model learns "this whole approach was wrong" instead of "your bug was in step 4." You need exponentially more examples to distinguish good patterns from bad patterns when signal is diluted this way.

Why step-level feedback matters:

  • Faster learning: Every turn provides useful gradient, not just the final outcome
  • Better generalization: Good patterns are reinforced even in failed trajectories
  • Clearer debugging: You can trace exactly which steps the model struggles with

SWEET-RL Architecture

SWEET-RL operates in two stages:

Stage 1: Train the Critic

The critic learns to assign advantages (how good is this action?) to each turn. It trains on preference pairs from offline trajectories:

  1. Take two trajectories for the same task
  2. Label the one with higher cumulative reward as "chosen"
  3. Train the critic to prefer chosen trajectories using Bradley-Terry objective
What is the Bradley-Terry objective?

A probabilistic model for pairwise comparisons. Given two items, it predicts the probability that one is preferred over the other based on their latent scores. The loss encourages the model to assign higher scores to the "chosen" trajectory. It is the same objective used in RLHF reward model training.

The key: the critic sees training-time information (the reference solution) that the actor never sees.

SWEET-RL: Asymmetric Actor-Critic Architecture

Critic uses privileged information to assign step-level rewards

Architecture Diagram Explanation: The chart above shows the asymmetric information flow. During training, conversation turns flow to both the Critic (which also receives the ground truth solution) and the Actor (which only sees history). The Critic produces step-level advantage scores that guide the Actor's learning. At deployment, only the Actor runs—the Critic is discarded.

Stage 2: Train the Actor

With the trained critic, optimize the actor policy:

  1. At each turn, sample 16 candidate responses
  2. Rank them by critic advantage scores
  3. Top 50% become "chosen", bottom 50% become "rejected"
  4. Apply DPO loss on these turn-level preferences

This converts the trajectory-level signal into step-level signal, enabling much more efficient learning.

Critic Parameterization

The advantage function uses a clever parameterization:

A(observation, action, context) =
  (1/L) * sum(log π_trained / log π_reference)

Where L is the number of tokens (for normalization). This is the mean log probability ratio between the trained and reference models.

Critical: Length Normalization

Without the 1/L normalization, agents degenerate to generating minimal responses (shorter = higher advantage). The paper shows success rate drops from 40.4% to 3.6% without normalization.

The Asymmetric Trick

The Teacher-Student Analogy

Think of SWEET-RL as a teacher (critic) with an answer key grading a student's (actor's) work. The teacher sees the correct solution and can tell the student "Your approach in step 3 is leading you astray" without revealing the answer. The student learns which steps work and which don't, without ever seeing the answer key. At test time, the student solves problems independently.

The core innovation is asymmetric information between actor and critic:

ComponentSees HistorySees SolutionPurpose
ActorYesNoGenerate responses
CriticYesYesJudge quality

Why this works:

  1. The critic can make informed judgments because it knows where the conversation should go
  2. The actor learns from these judgments without ever seeing the solution
  3. At deployment, the actor works without any privileged information

Training-Time Information by Industry

The "privileged information" the critic sees varies by domain. Here is what it looks like across industries:

IndustryTraining-Time Info (Critic Sees)Success MetricTypical Data Source
Software DevReference code, test casesUnit test pass rateGitHub repos, internal codebase
Customer SupportExpert resolution transcriptResolution rate, CSATTicket history, QA samples
Legal TechVerified contract clausesCompliance scoreReviewed documents
HealthcareClinical guidelines, diagnosisAccuracy vs specialistMedical records (de-identified)
E-commercePurchase completion pathConversion rateTransaction logs
EducationCorrect solution stepsQuiz performanceTutor session recordings

The key requirement: you must have verifiable ground truth during training. At deployment, the agent never sees this information.

ColBench Benchmark

The paper introduces ColBench, a benchmark for multi-turn collaborative agents:

Tasks

TaskDescriptionTurnsMetric
Backend ProgrammingWrite Python functions through conversation10Unit test pass rate
Frontend DesignCreate HTML pages through conversation10CLIP similarity to reference

Scale

  • Training: 10K+ backend tasks, 10K frontend tasks
  • Testing: 1K backend, 500 frontend (manually inspected)
  • Offline trajectories: 15K backend, 6K frontend

Why ColBench?

  1. Sufficient complexity: Tasks require multi-step reasoning, not single-turn solutions
  2. Minimal engineering overhead: LLM simulators with ground-truth artifacts, no complex environments
  3. Procedural generation: Prevents overfitting to specific examples

The benchmark is open-source and included in the official repository.

Benchmark Results

ColBench Results: Llama-3.1-8B-Instruct

SWEET-RL achieves 6% gain over Multi-Turn DPO, matching GPT-4o

BackendFrontendGPT-4o baseline

Main Results (Llama-3.1-8B-Instruct)

MethodBackend SuccessFrontend Win Rate
Zero-Shot22.4%33.8%
Rejection Fine-Tuning28.2%38.6%
Multi-Turn DPO34.4%42.8%
SWEET-RL40.4%48.2%

SWEET-RL achieves 6% absolute improvement over Multi-Turn DPO.

Comparison to Larger Models

ModelBackend Success
O1-Mini30.3%
Llama-3.1-70B Zero-Shot35.0%
GPT-4o40.4%
Llama-8B + SWEET-RL40.4%

An 8B model matches GPT-4o and beats O1-Mini. The 8B model with SWEET-RL also matches the 70B model with zero-shot prompting.

Off-Policy Learning (70B Model)

What is off-policy learning?

A training setup where data comes from a different policy (e.g., a smaller model) than the one being optimized. It lets you improve a target model using trajectories collected elsewhere—critical for production where you want to reuse existing logs.

Production Insight: Train Large Models on Small Model Data

SWEET-RL's off-policy capability is a game-changer for production: collect trajectories with your fast, cheap 8B model in production, then use that data to train your premium 70B model. You get the best of both worlds: cheap data collection at scale, plus a high-quality model for critical tasks. This is not possible with imitation learning approaches.

SWEET-RL works even when training a larger model on data from a smaller model:

MethodSuccess Rate
Zero-Shot35.0%
Rejection Fine-Tuning31.9% (worse!)
Multi-Turn DPO41.8%
SWEET-RL45.6%

Rejection Fine-Tuning actually hurts performance because it forces the 70B model to imitate suboptimal 8B trajectories word-by-word. SWEET-RL avoids this by using step-level rewards instead of direct imitation.

Ablation Studies

What Makes SWEET-RL Work?

ConfigurationSuccess Rate
SWEET-RL (full)40.4%
Without training-time info31.2%
With regression head36.2%
Without normalization3.6%

Three critical components:

  1. Training-time information: Without it, the critic cannot judge step quality. 9 percentage points lost.
  2. Log probability parameterization: Using a regression head instead loses 4 points. LLMs generalize better with their native output format.
  3. Length normalization: Without it, the model degenerates completely.

Credit Assignment Comparison

The paper compares different ways to assign step-level rewards:

MethodBest-of-16 Success
SWEET-RLBest scaling
LLM-as-JudgeLimited improvement
Value function headPoor generalization
No training-time infoSignificantly degraded

LLM-as-Judge (using another LLM to rate responses) gets distracted by length and format rather than actual quality. Value function heads (standard in RL) generalize poorly to new tasks.

Data Scaling

SWEET-RL requires more initial data to train a reliable critic:

  • At 3K samples: Multi-Turn DPO outperforms SWEET-RL
  • At 6K+ samples: SWEET-RL catches up and surpasses
  • At 15K samples: SWEET-RL significantly ahead

If you have limited data, Multi-Turn DPO may be a better choice. With sufficient data (6K+), SWEET-RL is clearly superior.

Implementation Blueprint

When to Use SWEET-RL

Use SWEET-RL when you have:

  1. Multi-turn tasks (3+ turns of interaction)
  2. Ground-truth artifacts for training (reference code, target outputs, solution paths)
  3. Sufficient data (6K+ trajectory pairs, more is better)
  4. Clear success criteria (unit tests, similarity scores, verifiable outcomes)

Do not use SWEET-RL for:

  • Single-turn tasks (use regular DPO)
  • Tasks without ground truth (use RLHF or LLM-as-Judge)
  • Limited data scenarios (< 3K examples, use Multi-Turn DPO)

Data Qualification Checklist

Before starting SWEET-RL implementation, assess your readiness:

SWEET-RL Readiness Scorecard

Data Volume Threshold

Below 3K trajectories: Use Multi-Turn DPO instead 3K–6K trajectories: SWEET-RL may underperform DPO 6K+ trajectories: SWEET-RL recommended (15K+ optimal)

Required Checkpoints

[✓] Answer Key: Ground-truth solution exists for every task (reference code, expert transcript, target output)

[✓] Multi-Turn: Tasks average 3+ conversation turns

[✓] Success Metric: Automated verification available (unit tests, similarity scores, pass/fail labels)

[✓] Outcome Mix: Dataset includes both successful (30-50%) and failed (50-70%) trajectories

[✓] Role Labels: Conversation turns clearly marked (user/assistant/system)

Score interpretation: If you cannot check all boxes, consider Multi-Turn DPO (needs only success labels) or standard fine-tuning. Missing the 6K+ volume threshold is the most common blocker.

Tech Stack

ComponentRecommendedNotes
Base modelLlama-3.1-8B-InstructPaper's primary model
Critic modelSame architecture as actorOr VLM for multimodal (Qwen2-VL-7B)
Training frameworkOfficial repoIncludes ColBench
Compute8x A100 GPUsEstimated for 8B model

Step-by-Step Workflow

Phase 1: Data Collection

1. Define your task with clear success criteria
2. Collect 10K+ task instances with ground truth
3. Generate trajectories using your base model
4. Filter for both successful and failed attempts
5. Target: 15K+ trajectories (mix of outcomes)

Phase 2: Critic Training

1. Construct preference pairs:
   - Same task, different trajectories
   - Higher cumulative reward = chosen
2. Train critic with Bradley-Terry objective
3. Include training-time info (ground truth) as input
4. Validate on held-out preference pairs

Phase 3: Actor Training

1. For each turn in each trajectory:
   - Sample 16 candidate responses
   - Score with trained critic
   - Top 8 = chosen, bottom 8 = rejected
2. Apply DPO loss on turn-level preferences
3. Repeat for multiple epochs

Key Parameters

ParameterValueNotes
Candidate samples per turn16For Best-of-N during training
Chosen/rejected split50/50Top half vs bottom half
Min training trajectories6,000Below this, use Multi-Turn DPO
Max turns per trajectory10ColBench default
Length normalizationRequiredDivide advantage by token count

Adaptation to Your Domain

Customer Support Bot:

  • Ground truth: Expert response transcripts
  • Success metric: Resolution rate + customer satisfaction
  • Turns: Typically 5-15

Code Assistant:

  • Ground truth: Correct code implementation
  • Success metric: Unit test pass rate
  • Turns: Typically 3-10

Tutoring System:

  • Ground truth: Ideal explanation path
  • Success metric: Student quiz performance
  • Turns: Typically 5-20

Limitations

  1. Data requirements: SWEET-RL needs more data than Multi-Turn DPO to train an effective critic. At 3K samples, it underperforms.

  2. Ground truth dependency: You need training-time information (reference solutions). For tasks without clear ground truth, this is not applicable.

  3. Offline only: The paper tests offline RL (learning from collected data). Online learning with human feedback remains expensive.

  4. Domain tested: Only validated on artifact creation (code, webpages). Generalization to other domains (dialogue, planning) is untested.

  5. Compute cost: Training both critic and actor requires roughly 2x the compute of standard fine-tuning.

Cold Start Solution: Bootstrap with GPT-4

If you lack 6K+ trajectories, bootstrap your dataset: use GPT-4 to generate initial trajectories for your tasks, then label success/failure with your automated tests or human review. This gives you the data volume needed to train the SWEET-RL critic. As your own model improves, gradually replace GPT-4 trajectories with self-generated ones.

The Bottom Line

For ML engineers building multi-turn agents:

SWEET-RL is the current best method if you have:

  • 6K+ training trajectories
  • Ground truth artifacts for training
  • Multi-turn tasks (3+ turns)

The 6% improvement over Multi-Turn DPO is significant, and matching GPT-4o with an 8B model means major cost savings at deployment.

For engineering managers:

Budget 2-3 weeks for integration. The official Facebook Research code provides a solid starting point. ROI calculation: if your 8B model can match GPT-4o quality, you save ~10x on inference costs at scale.

For researchers:

The asymmetric actor-critic design (critic sees privileged information, actor does not) is a generalizable pattern. The key insight is that training-time information should inform credit assignment, not just final outcomes.

Official resources:

Authors

Yifei ZhouUC Berkeley,Song JiangMeta AI,Yuandong TianMeta AI,Jason WestonMeta AI,Sergey LevineUC Berkeley,Sainbayar SukhbaatarMeta AI,Xian LiMeta AI

Cite this paper

Yifei Zhou, Song Jiang, Yuandong Tian, Jason Weston, Sergey Levine, Sainbayar Sukhbaatar, Xian Li (2025). SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks. arXiv 2025.

Related Research