Black-Box On-Policy Distillation: Learning from Closed-Source LLMs

TL;DR

The Problem. You want GPT-5 quality but can't afford API costs. Standard distillation requires access to teacher model internals — impossible with closed-source models
The Fix. GAD treats distillation as an adversarial game. A discriminator learns to distinguish teacher from student outputs, providing a training signal that evolves with the student to prevent reward hacking
The Result. A 14B student model matches its GPT-5 teacher on LMSYS-Chat. A 3B GAD model matches a 7B standard-distilled model — enabling 2x compression at equivalent performance

Research Overview

You want to build a model that performs like GPT-5, but you can't afford the API costs at scale. You can't see GPT-5's internal weights or training data. All you have is its text outputs. How do you transfer that capability to your own model?

This is the black-box distillation problem, and GAD (Generative Adversarial Distillation) offers a compelling solution. By framing distillation as an adversarial game—similar to how GANs generate images—GAD enables effective knowledge transfer from closed-source models.

The Key Insight

Traditional distillation requires access to the teacher's internal logits (probability distributions over tokens). Black-box distillation works from text outputs alone. GAD makes this work by training a discriminator to distinguish teacher vs student responses, then using that discriminator as a reward signal for the student.

Key Results at a Glance

Student Model	Before	Standard Distill	GAD	Teacher
Qwen2.5-3B	45.8	47.5	48.9	51.7
Qwen2.5-7B	48.7	49.2	50.8	51.7
Qwen2.5-14B	50.0	50.6	52.1	51.7

The 14B student trained with GAD actually exceeds its GPT-5-Chat teacher on the LMSYS-Chat benchmark.

Note on "GPT-5-Chat"

The paper uses "GPT-5-Chat" as the teacher model name—likely an internal Microsoft Research codename for a high-capability prototype, as GPT-5 is not publicly released. The principles demonstrated apply equally to any frontier model like GPT-4o or Claude 3.5 Opus.

GAD Performance on MuJoCo Tasks

Average return comparison with baseline methods

The Black-Box Problem

Why Can't We Just Copy the Outputs?

The naive approach to black-box distillation is Sequence-Level Knowledge Distillation (SeqKD): collect teacher responses, fine-tune the student to reproduce them. This is essentially supervised learning on teacher outputs.

SeqKD works, but poorly. The problem is mode-covering behavior: the student tries to assign probability mass to all possible teacher outputs, even those it can't actually generate well. This leads to:

Overfitting to surface patterns (N-grams, formatting)
Poor generalization to new prompts
Mediocre performance despite extensive training

Mode-Covering vs Mode-Seeking

Mode-covering (SeqKD): "Try to be somewhat good at everything the teacher does." Spreads probability across all teacher behaviors, even unreachable ones.

Mode-seeking (GAD): "Be excellent at the subset of teacher behaviors you can actually achieve." Concentrates on reachable, high-quality responses.

The Blurry Image Analogy: Think of SeqKD as averaging all possible answers, resulting in a "blurry" generic response. GAD forces the model to commit to one high-quality answer, producing a "sharp," distinctive response—even if it ignores some other valid phrasings the teacher might have used.

The Missing Ingredient: Feedback on Your Own Outputs

SeqKD only learns from teacher outputs. The student never gets feedback on its own generations. This is like learning to paint by only looking at masterpieces, never seeing your own work critiqued.

GAD fixes this with on-policy learning: the student generates responses, and a discriminator evaluates them in real-time. The student learns not just "what good looks like" but "how to make my outputs better."

How GAD Works

GAD frames distillation as a two-player game, similar to Generative Adversarial Networks (GANs) for image generation.

GAD: Generative Adversarial Distillation Framework

Student (generator) vs Discriminator in adversarial training

The Players

Generator (Student LLM): Produces responses to prompts. Its goal is to fool the discriminator into thinking its outputs came from the teacher.

Discriminator: Learns to distinguish teacher responses from student responses. It assigns higher scores to teacher outputs and lower scores to student outputs.

The Training Loop

Phase 1: Warmup (1 epoch)

Both components need a reasonable starting point before the adversarial game begins:

Student warmup: Fine-tunes on teacher responses (standard SeqKD), giving it a "best initial guess" at teacher-like behavior
Discriminator warmup: Learns to distinguish teacher responses from the student's SFT outputs—essentially learning "what makes the teacher different from a naive copy"

This bootstrapping is critical. If the discriminator starts random, it provides no useful signal. If the student starts random, the discriminator trivially rejects everything. The warmup ensures both players enter the game at a competitive level.

Phase 2: Adversarial Training (2 epochs)

The game begins:

Student generates N=8 responses per prompt
Discriminator scores each response
Student updates to maximize discriminator scores (using policy gradient)
Discriminator updates to better distinguish teacher from improved student

Why This Works

The discriminator acts as an evolving reward model that tracks the student's progress. As the student improves, the discriminator adapts, always providing meaningful feedback. This co-evolution prevents the reward hacking that plagues static reward models.

The Discriminator as Reward Model

The discriminator is initialized from the student's weights with an added prediction head. It takes a prompt-response pair and outputs a scalar score.

Not Your Typical GAN Discriminator

Unlike a traditional GAN discriminator that outputs binary "real or fake," the GAD discriminator assigns a continuous scalar score representing "how teacher-like is this response?" This dense reward signal provides rich gradient information—the student learns not just "wrong" but "how wrong and in what direction to improve."

The training objective uses Bradley-Terry preference modeling: given a teacher response and student response, maximize the probability that the teacher scores higher. This relative comparison is more robust than absolute scoring.

On-Policy vs Off-Policy

The "on-policy" aspect of GAD is crucial. Let's see why.

On-Policy vs Off-Policy: Avoiding Reward Hacking

Off-policy discriminator leads to exploit; on-policy co-evolves with student

Off-Policy Approach (Baseline)

Train discriminator on initial student outputs
Freeze discriminator
Use frozen discriminator as reward model
Train student via reinforcement learning

Problem: The student quickly learns to exploit the frozen discriminator. After ~300 steps, it starts generating excessively long responses (up to 1,300 tokens) that score well on the static reward model but are actually worse. This is reward hacking.

On-Policy Approach (GAD)

Train discriminator and student simultaneously
Discriminator continuously updates on current student outputs
No opportunity for reward hacking—discriminator adapts to any exploit

Result: Stable training for thousands of steps without reward hacking. The student improves genuinely rather than gaming a fixed metric.

The Stability Principle

In traditional RL with fixed rewards, the reward function defines a static landscape the model can exploit. In GAD, the "reward landscape" (discriminator) evolves with the student. Exploits get patched as soon as the discriminator notices them. This creates a moving target that drives genuine improvement.

Benchmark Results

Primary Evaluation: LMSYS-Chat

LMSYS-Chat contains real user conversations with LLMs, providing a realistic test of general chat ability.

Model	Before	SeqKD	GAD	Δ vs SeqKD
Qwen2.5-3B	45.8	47.5	48.9	+1.4
Qwen2.5-7B	48.7	49.2	50.8	+1.6
Qwen2.5-14B	50.0	50.6	52.1	+1.5
Llama-3.2-3B	44.0	47.6	48.1	+0.5
Llama-3.1-8B	46.9	49.7	50.3	+0.6

GAD consistently outperforms SeqKD across all model sizes and architectures.

Out-of-Distribution Generalization

The real test: how do models perform on datasets they weren't trained on?

Dataset	SeqKD	GAD	Δ
Dolly	49.1	50.2	+1.1
Self-Instruct	48.7	49.8	+1.1
Vicuna	50.3	51.2	+0.9

SeqKD shows "marginal or even negative improvements" on out-of-distribution data. GAD maintains robust gains, demonstrating that it learns transferable capabilities rather than dataset-specific patterns.

Human Evaluation

GPT-4 scores can be gamed. Do humans agree?

GAD achieved win rates exceeding 50% and loss rates below 30% versus SeqKD across all tested models in pairwise human evaluation. The improvements are perceptible to human judges, not just automatic metrics.

Compression Efficiency

Key finding: Qwen2.5-3B trained with GAD matches Qwen2.5-7B trained with SeqKD.

This means GAD enables ~2x model compression at equivalent performance. For deployment, this translates to:

2x faster inference
2x less memory
Significant cost savings at scale

Cross-Architecture Transfer

One practical challenge: your teacher (GPT-5) and student (Llama) might use incompatible tokenizers. White-box distillation methods that compare token distributions simply don't work.

GAD handles this naturally because it only uses text outputs. The discriminator compares complete responses, not token-level probabilities.

Results: Distilling GPT-5 → Llama

Model	SeqKD	GAD
Llama-3.2-3B	47.6	48.1
Llama-3.1-8B	49.7	50.3

GAD successfully transfers knowledge across the Qwen→Llama architecture boundary, enabling practical deployment scenarios where your preferred deployment architecture differs from available teacher models.

Business Implications

This paper has significant ramifications across the AI industry. Here's what different stakeholders can expect:

For Developers

Reduced API Dependency: Production systems can transition from expensive API calls to self-hosted models. A team paying $50K/month in API costs could instead invest in one-time training (~$10-15K at current H100 rates) plus much cheaper inference.

Cross-Architecture Freedom: Your deployment target doesn't need to match the teacher. Distill from GPT-5 to Llama for better open-source tooling, or to specialized architectures optimized for your hardware.

Specialization Path: Train a general-purpose student, then fine-tune for your domain. The distilled model serves as a strong foundation that's cheaper to customize than starting from scratch.

For Business Owners

Predictable Costs: Per-token API pricing creates unpredictable expenses that scale with success. Self-hosted models have fixed infrastructure costs regardless of usage volume.

Data Privacy by Design: Customer prompts never leave your infrastructure. This matters for healthcare, finance, legal, and any regulated industry where data residency is a concern.

Competitive Moat: If your competitor depends on the same API you do, you have no differentiation. A distilled model trained on your use cases becomes a defensible asset.

For Enterprise AI Teams

Vendor Independence: Reduce lock-in to any single AI provider. If OpenAI changes pricing, policies, or service quality, you're not held hostage.

Compliance Readiness: Many enterprises struggle with AI adoption due to compliance requirements around data handling. On-premises deployment eliminates many regulatory concerns.

Customization Control: Unlike API-based models, you can fine-tune your distilled model for specific workflows, add domain knowledge, or adjust behavior without waiting for provider updates.

For AI Practitioners

Democratization Signal: Frontier capabilities are no longer gated by capital-intensive pre-training. A research team with 16 H100s (accessible at many universities) can create competitive models.

New Research Direction: The on-policy discriminator approach opens questions about co-evolutionary training dynamics, transferable learning signals, and architectural invariance in distillation.

For Model Providers (OpenAI, Anthropic, Google)

Moat Erosion: If customers can distill your model's capabilities into their own, what's your long-term value proposition? This paper signals that API access may be sufficient for capability transfer.

Potential Responses: Providers may need to evolve toward:

Superior reasoning capabilities that resist distillation
Agentic features (tool use, memory) harder to replicate
Faster iteration cycles that outpace distillation efforts
Service differentiation (reliability, support, integrations)

Limitations

Computational Cost

GAD requires significantly more compute than standard fine-tuning:

30 hours on 16 H100 GPUs for a 14B model
N=8 responses generated per prompt (vs 1 for SeqKD)
Discriminator training overhead

For smaller organizations, this cost may be prohibitive. However, context matters: pre-training a 14B model from scratch would take months on thousands of GPUs. GAD offers frontier-class capability at a fraction of that cost—expensive for fine-tuning, but cheap for obtaining a GPT-5-tier model.

Warmup Sensitivity

Performance depends heavily on proper warmup:

Without generator warmup: -1.1 points on LMSYS
Without discriminator warmup: -2.3 points on out-of-distribution tests

The warmup stage requires careful tuning that may not transfer across model families.

Teacher Quality Bound

GAD primarily aims to match the teacher's output distribution. Because the discriminator rewards "teacher-like" quality, the student might occasionally produce better individual answers than the teacher's average—but it's fundamentally bounded by what the teacher demonstrates. If your teacher has systematic biases or errors, the student inherits them. The discriminator learns "what the teacher does," not "what is objectively correct."

When GAD Isn't the Right Choice

If you have access to the teacher's logits (white-box access), traditional distillation methods are more efficient. GAD's value is specifically for closed-source teachers where black-box access is all you have.

Response Length Dynamics

GAD maintains the student's original response length distribution rather than matching the teacher's. For applications where teacher-length responses are specifically desired, this behavior may require additional tuning.

Conclusion

GAD demonstrates that effective distillation from closed-source models is possible. By framing the problem as an adversarial game with an evolving discriminator, it overcomes the reward hacking that plagues simpler approaches.

Key Takeaways:

Black-box distillation works: You don't need teacher logits to transfer knowledge effectively
On-policy is crucial: Co-evolving discriminator prevents reward hacking
Compression bonus: ~2x effective model compression vs standard distillation
Cross-architecture: Works across incompatible tokenizer families
Real improvements: Human evaluators confirm the gains aren't metric artifacts

For organizations dependent on proprietary API costs, GAD offers a practical path to building competitive internal models. While the training cost is significant, the long-term operational savings and deployment flexibility may justify the investment.

Original paper: arXiv ・ PDF ・ HTML

Authors

Tianzhu YeMicrosoft Research,Li DongMicrosoft Research,Zewen ChiMicrosoft Research,Xun WuMicrosoft Research,Shaohan HuangMicrosoft Research,Furu WeiMicrosoft Research

Cite this paper

Tianzhu Ye, Li Dong, Zewen Chi, Xun Wu, Shaohan Huang, Furu Wei (2025). Black-Box On-Policy Distillation: Learning from Closed-Source LLMs. arXiv 2025.

Key Findings