Important Finding arXiv 2025 November 13, 2025
Applies to: API Cost ReductionModel CompressionEdge DeploymentCross-Architecture TransferEnterprise AI

Black-Box On-Policy Distillation: Learning from Closed-Source LLMs

Tianzhu Ye Microsoft Research , Li Dong Microsoft Research , Zewen Chi Microsoft Research , Xun Wu Microsoft Research , Shaohan Huang Microsoft Research , Furu Wei Microsoft Research

GAD (Generative Adversarial Distillation) solves a fundamental problem: how do you train a student model from a teacher you can't look inside? By treating distillation as an adversarial game—student as generator, discriminator as evolving judge—GAD enables effective learning from text outputs alone. The discriminator co-evolves with the student, providing stable feedback that prevents the reward hacking seen in fixed reward approaches.

Categories: Large Language ModelsKnowledge DistillationMachine Learning
Topics: Black-Box DistillationGenerative Adversarial NetworksModel CompressionOn-Policy LearningLLM Training

Key Findings

1

14B student model matches GPT-5-Chat teacher on LMSYS-Chat benchmark

2

On-policy discriminator prevents reward hacking that plagues fixed reward models

3

Works across incompatible architectures (Qwen to Llama distillation)

4

3B model trained with GAD matches 7B model trained with standard distillation

5

Superior out-of-distribution generalization vs sequence-level distillation

6

30 hours on 16 H100 GPUs to distill a 14B model

Jump to section
TL;DR
  1. The Problem. You want GPT-5 quality but can’t afford API costs. Standard distillation requires access to teacher model internals — impossible with closed-source models

  2. The Fix. GAD treats distillation as an adversarial game. A discriminator learns to distinguish teacher from student outputs, providing a training signal that evolves with the student to prevent reward hacking

  3. The Result. A 14B student model matches its GPT-5 teacher on LMSYS-Chat. A 3B GAD model matches a 7B standard-distilled model — enabling 2x compression at equivalent performance

Research Overview

You want to build a model that performs like GPT-5, but you can’t afford the API costs at scale. You can’t see GPT-5’s internal weights or training data. All you have is its text outputs. How do you transfer that capability to your own model?

This is the black-box distillation problem, and GAD (Generative Adversarial Distillation) offers a compelling solution. By framing distillation as an adversarial game—similar to how GANs generate images—GAD enables effective knowledge transfer from closed-source models.

The Key Insight

Traditional distillation requires access to the teacher’s internal logits (probability distributions over tokens). Black-box distillation works from text outputs alone. GAD makes this work by training a discriminator to distinguish teacher vs student responses, then using that discriminator as a reward signal for the student.

Key Results at a Glance

Student ModelBeforeStandard DistillGADTeacher
Qwen2.5-3B45.847.548.951.7
Qwen2.5-7B48.749.250.851.7
Qwen2.5-14B50.050.652.151.7

The 14B student trained with GAD actually exceeds its GPT-5-Chat teacher on the LMSYS-Chat benchmark.

Note on “GPT-5-Chat”

The paper uses “GPT-5-Chat” as the teacher model name—likely an internal Microsoft Research codename for a high-capability prototype, as GPT-5 is not publicly released. The principles demonstrated apply equally to any frontier model like GPT-4o or Claude 3.5 Opus.

GAD vs Standard Distillation (LMSYS-Chat)

GPT-4o evaluation scores. Teacher (GPT-5-Chat) = 51.7

The Black-Box Problem

Why Can’t We Just Copy the Outputs?

The naive approach to black-box distillation is Sequence-Level Knowledge Distillation (SeqKD): collect teacher responses, fine-tune the student to reproduce them. This is essentially supervised learning on teacher outputs.

SeqKD works, but poorly. The problem is mode-covering behavior: the student tries to assign probability mass to all possible teacher outputs, even those it can’t actually generate well. This leads to:

  • Overfitting to surface patterns (N-grams, formatting)
  • Poor generalization to new prompts
  • Mediocre performance despite extensive training
Mode-Covering vs Mode-Seeking

Mode-covering (SeqKD): “Try to be somewhat good at everything the teacher does.” Spreads probability across all teacher behaviors, even unreachable ones.

Mode-seeking (GAD): “Be excellent at the subset of teacher behaviors you can actually achieve.” Concentrates on reachable, high-quality responses.

The Blurry Image Analogy: Think of SeqKD as averaging all possible answers, resulting in a “blurry” generic response. GAD forces the model to commit to one high-quality answer, producing a “sharp,” distinctive response—even if it ignores some other valid phrasings the teacher might have used.

The Missing Ingredient: Feedback on Your Own Outputs

SeqKD only learns from teacher outputs. The student never gets feedback on its own generations. This is like learning to paint by only looking at masterpieces, never seeing your own work critiqued.

GAD fixes this with on-policy learning: the student generates responses, and a discriminator evaluates them in real-time. The student learns not just “what good looks like” but “how to make my outputs better.”

How GAD Works

GAD frames distillation as a two-player game, similar to Generative Adversarial Networks (GANs) for image generation.

GAD: Generative Adversarial Distillation Framework

Student (generator) vs Discriminator in adversarial training

The Players

Generator (Student LLM): Produces responses to prompts. Its goal is to fool the discriminator into thinking its outputs came from the teacher.

Discriminator: Learns to distinguish teacher responses from student responses. It assigns higher scores to teacher outputs and lower scores to student outputs.

The Training Loop

Phase 1: Warmup (1 epoch)

Both components need a reasonable starting point before the adversarial game begins:

  • Student warmup: Fine-tunes on teacher responses (standard SeqKD), giving it a “best initial guess” at teacher-like behavior
  • Discriminator warmup: Learns to distinguish teacher responses from the student’s SFT outputs—essentially learning “what makes the teacher different from a naive copy”

This bootstrapping is critical. If the discriminator starts random, it provides no useful signal. If the student starts random, the discriminator trivially rejects everything. The warmup ensures both players enter the game at a competitive level.

Phase 2: Adversarial Training (2 epochs)

The game begins:

  1. Student generates N=8 responses per prompt
  2. Discriminator scores each response
  3. Student updates to maximize discriminator scores (using policy gradient)
  4. Discriminator updates to better distinguish teacher from improved student
Why This Works

The discriminator acts as an evolving reward model that tracks the student’s progress. As the student improves, the discriminator adapts, always providing meaningful feedback. This co-evolution prevents the reward hacking that plagues static reward models.

The Discriminator as Reward Model

The discriminator is initialized from the student’s weights with an added prediction head. It takes a prompt-response pair and outputs a scalar score.

Not Your Typical GAN Discriminator

Unlike a traditional GAN discriminator that outputs binary “real or fake,” the GAD discriminator assigns a continuous scalar score representing “how teacher-like is this response?” This dense reward signal provides rich gradient information—the student learns not just “wrong” but “how wrong and in what direction to improve.”

The training objective uses Bradley-Terry preference modeling: given a teacher response and student response, maximize the probability that the teacher scores higher. This relative comparison is more robust than absolute scoring.

On-Policy vs Off-Policy

The “on-policy” aspect of GAD is crucial. Let’s see why.

On-Policy vs Off-Policy: Avoiding Reward Hacking

Off-policy discriminator leads to exploit; on-policy co-evolves with student

Off-Policy Approach (Baseline)

  1. Train discriminator on initial student outputs
  2. Freeze discriminator
  3. Use frozen discriminator as reward model
  4. Train student via reinforcement learning

Problem: The student quickly learns to exploit the frozen discriminator. After ~300 steps, it starts generating excessively long responses (up to 1,300 tokens) that score well on the static reward model but are actually worse. This is reward hacking.

On-Policy Approach (GAD)

  1. Train discriminator and student simultaneously
  2. Discriminator continuously updates on current student outputs
  3. No opportunity for reward hacking—discriminator adapts to any exploit

Result: Stable training for thousands of steps without reward hacking. The student improves genuinely rather than gaming a fixed metric.

The Stability Principle

In traditional RL with fixed rewards, the reward function defines a static landscape the model can exploit. In GAD, the “reward landscape” (discriminator) evolves with the student. Exploits get patched as soon as the discriminator notices them. This creates a moving target that drives genuine improvement.

Benchmark Results

Primary Evaluation: LMSYS-Chat

LMSYS-Chat contains real user conversations with LLMs, providing a realistic test of general chat ability.

ModelBeforeSeqKDGADΔ vs SeqKD
Qwen2.5-3B45.847.548.9+1.4
Qwen2.5-7B48.749.250.8+1.6
Qwen2.5-14B50.050.652.1+1.5
Llama-3.2-3B44.047.648.1+0.5
Llama-3.1-8B46.949.750.3+0.6

GAD consistently outperforms SeqKD across all model sizes and architectures.

Out-of-Distribution Generalization

The real test: how do models perform on datasets they weren’t trained on?

DatasetSeqKDGADΔ
Dolly49.150.2+1.1
Self-Instruct48.749.8+1.1
Vicuna50.351.2+0.9

SeqKD shows “marginal or even negative improvements” on out-of-distribution data. GAD maintains robust gains, demonstrating that it learns transferable capabilities rather than dataset-specific patterns.

Human Evaluation

GPT-4 scores can be gamed. Do humans agree?

GAD achieved win rates exceeding 50% and loss rates below 30% versus SeqKD across all tested models in pairwise human evaluation. The improvements are perceptible to human judges, not just automatic metrics.

Compression Efficiency

Key finding: Qwen2.5-3B trained with GAD matches Qwen2.5-7B trained with SeqKD.

This means GAD enables ~2x model compression at equivalent performance. For deployment, this translates to:

  • 2x faster inference
  • 2x less memory
  • Significant cost savings at scale

Cross-Architecture Transfer

One practical challenge: your teacher (GPT-5) and student (Llama) might use incompatible tokenizers. White-box distillation methods that compare token distributions simply don’t work.

GAD handles this naturally because it only uses text outputs. The discriminator compares complete responses, not token-level probabilities.

Results: Distilling GPT-5 → Llama

ModelSeqKDGAD
Llama-3.2-3B47.648.1
Llama-3.1-8B49.750.3

GAD successfully transfers knowledge across the Qwen→Llama architecture boundary, enabling practical deployment scenarios where your preferred deployment architecture differs from available teacher models.

Business Implications

This paper has significant ramifications across the AI industry. Here’s what different stakeholders can expect:

For Developers

Reduced API Dependency: Production systems can transition from expensive API calls to self-hosted models. A team paying $50K/month in API costs could instead invest in one-time training (~$10-15K at current H100 rates) plus much cheaper inference.

Cross-Architecture Freedom: Your deployment target doesn’t need to match the teacher. Distill from GPT-5 to Llama for better open-source tooling, or to specialized architectures optimized for your hardware.

Specialization Path: Train a general-purpose student, then fine-tune for your domain. The distilled model serves as a strong foundation that’s cheaper to customize than starting from scratch.

For Business Owners

Predictable Costs: Per-token API pricing creates unpredictable expenses that scale with success. Self-hosted models have fixed infrastructure costs regardless of usage volume.

Data Privacy by Design: Customer prompts never leave your infrastructure. This matters for healthcare, finance, legal, and any regulated industry where data residency is a concern.

Competitive Moat: If your competitor depends on the same API you do, you have no differentiation. A distilled model trained on your use cases becomes a defensible asset.

For Enterprise AI Teams

Vendor Independence: Reduce lock-in to any single AI provider. If OpenAI changes pricing, policies, or service quality, you’re not held hostage.

Compliance Readiness: Many enterprises struggle with AI adoption due to compliance requirements around data handling. On-premises deployment eliminates many regulatory concerns.

Customization Control: Unlike API-based models, you can fine-tune your distilled model for specific workflows, add domain knowledge, or adjust behavior without waiting for provider updates.

For AI Practitioners

Democratization Signal: Frontier capabilities are no longer gated by capital-intensive pre-training. A research team with 16 H100s (accessible at many universities) can create competitive models.

New Research Direction: The on-policy discriminator approach opens questions about co-evolutionary training dynamics, transferable learning signals, and architectural invariance in distillation.

For Model Providers (OpenAI, Anthropic, Google)

Moat Erosion: If customers can distill your model’s capabilities into their own, what’s your long-term value proposition? This paper signals that API access may be sufficient for capability transfer.

Potential Responses: Providers may need to evolve toward:

  • Superior reasoning capabilities that resist distillation
  • Agentic features (tool use, memory) harder to replicate
  • Faster iteration cycles that outpace distillation efforts
  • Service differentiation (reliability, support, integrations)

Limitations

Computational Cost

GAD requires significantly more compute than standard fine-tuning:

  • 30 hours on 16 H100 GPUs for a 14B model
  • N=8 responses generated per prompt (vs 1 for SeqKD)
  • Discriminator training overhead

For smaller organizations, this cost may be prohibitive. However, context matters: pre-training a 14B model from scratch would take months on thousands of GPUs. GAD offers frontier-class capability at a fraction of that cost—expensive for fine-tuning, but cheap for obtaining a GPT-5-tier model.

Warmup Sensitivity

Performance depends heavily on proper warmup:

  • Without generator warmup: -1.1 points on LMSYS
  • Without discriminator warmup: -2.3 points on out-of-distribution tests

The warmup stage requires careful tuning that may not transfer across model families.

Teacher Quality Bound

GAD primarily aims to match the teacher’s output distribution. Because the discriminator rewards “teacher-like” quality, the student might occasionally produce better individual answers than the teacher’s average—but it’s fundamentally bounded by what the teacher demonstrates. If your teacher has systematic biases or errors, the student inherits them. The discriminator learns “what the teacher does,” not “what is objectively correct.”

When GAD Isn’t the Right Choice

If you have access to the teacher’s logits (white-box access), traditional distillation methods are more efficient. GAD’s value is specifically for closed-source teachers where black-box access is all you have.

Response Length Dynamics

GAD maintains the student’s original response length distribution rather than matching the teacher’s. For applications where teacher-length responses are specifically desired, this behavior may require additional tuning.

Conclusion

GAD demonstrates that effective distillation from closed-source models is possible. By framing the problem as an adversarial game with an evolving discriminator, it overcomes the reward hacking that plagues simpler approaches.

Key Takeaways:

  1. Black-box distillation works: You don’t need teacher logits to transfer knowledge effectively
  2. On-policy is crucial: Co-evolving discriminator prevents reward hacking
  3. Compression bonus: ~2x effective model compression vs standard distillation
  4. Cross-architecture: Works across incompatible tokenizer families
  5. Real improvements: Human evaluators confirm the gains aren’t metric artifacts

For organizations dependent on proprietary API costs, GAD offers a practical path to building competitive internal models. While the training cost is significant, the long-term operational savings and deployment flexibility may justify the investment.


Original paper: arXivPDFHTML

Cite this paper

Tianzhu Ye, Li Dong, Zewen Chi, Xun Wu, Shaohan Huang, Furu Wei (2025). Black-Box On-Policy Distillation: Learning from Closed-Source LLMs. arXiv 2025.