Important Finding arXiv 2025 January 22, 2025
Applies to: Reasoning ApplicationsCost-Efficient AI DeploymentOpen-Source AI DevelopmentModel DistillationEnterprise AI

DeepSeek-R1 & V3: The Open-Source Reasoning Revolution

DeepSeek-AI DeepSeek

DeepSeek released two papers that disrupted assumptions about AI development costs. DeepSeek-V3 introduced a 671B parameter MoE model trained for under $6M using FP8 precision and Multi-head Latent Attention. DeepSeek-R1 then demonstrated that pure reinforcement learning—without supervised fine-tuning—can develop sophisticated reasoning capabilities matching OpenAI o1. The open-source release of both models, including distilled variants from 1.5B to 70B parameters, democratized access to frontier reasoning capabilities.

Categories: Large Language ModelsReasoning ModelsMachine Learning
Topics: Reinforcement LearningMixture of ExpertsChain-of-ThoughtModel EfficiencyOpen Source AI

Key Findings

1

DeepSeek-R1 matches OpenAI o1-1217 on reasoning benchmarks using pure RL

2

V3 trained on 2,048 H800 GPUs in ~2 months for $5.5M (vs $100M+ for comparable models)

3

First successful FP8 training at 671B parameter scale

4

R1-Zero develops reasoning behaviors without any supervised fine-tuning

5

Distilled 32B model outperforms o1-mini on AIME (72.6% vs 63.6%)

6

Open weights under MIT license enable commercial use and further distillation

Jump to section

Research Overview

In January 2025, a Chinese AI lab called DeepSeek released two papers that sent shockwaves through the AI industry. The message was clear: frontier AI capabilities don’t require frontier budgets.

DeepSeek-V3 demonstrated that a 671B parameter model could be trained for approximately $5.5 million—roughly 1/20th of what comparable models were rumored to cost. DeepSeek-R1 then showed that pure reinforcement learning, without expensive supervised fine-tuning on reasoning traces, could match OpenAI’s o1 on mathematical and coding benchmarks.

The Industry Impact

Downloads of DeepSeek models increased nearly 1000% following the R1 release. The papers challenged the assumption that only well-funded Western labs could produce frontier AI, and demonstrated that algorithmic innovation could substitute for raw compute spending.

Most significantly, DeepSeek open-sourced everything: model weights, training recipes, and distilled variants. This wasn’t just a research contribution—it was a democratization of reasoning AI.

Understanding the Model Family: R1-Zero vs R1 vs Distilled

R1-Zero: An experimental model trained with pure RL only (no supervised fine-tuning). Demonstrates emergent reasoning but has quality issues—not for production use.

R1: The production model. Uses “Cold Start” SFT (on curated R1-Zero outputs) followed by large-scale RL training. This is the 671B model that matches o1.

Distilled Models (1.5B–70B): Created via supervised fine-tuning on R1’s reasoning traces—NOT trained with RL. They mimic R1’s reasoning patterns but don’t learn through reinforcement. Many developers mistakenly assume the 32B model uses RL; it doesn’t.

Key Results at a Glance

ModelParametersAIME 2024MATH-500GPQA DiamondCost
OpenAI o1-1217Unknown79.2%96.4%75.7%API only
DeepSeek-R1671B (37B active)79.8%97.3%71.5%Open weights
DeepSeek-R1-Distill-32B32B72.6%94.3%62.1%Open weights
OpenAI o1-miniUnknown63.6%90.0%60.0%API only

DeepSeek-R1 matches or exceeds o1 on mathematical reasoning while being completely open-source.

Why DeepSeek Matters

The Cost Revolution

Training large language models was assumed to require hundreds of millions of dollars. DeepSeek-V3 challenged this:

AspectIndustry AssumptionDeepSeek-V3 Reality
Training cost$100M+~$5.5M
GPU hours10M+ H100 hours2.788M H800 hours
Training stabilityRequires many restartsZero rollbacks
PrecisionBF16/FP32FP8 (first at scale)
How Did They Do It?

Three innovations drove the cost reduction: (1) FP8 mixed-precision training that doubles effective compute, (2) Multi-head Latent Attention that reduces memory requirements, and (3) an auxiliary-loss-free load balancing strategy for MoE that improves training stability. None of these are “secret sauce”—they’re engineering excellence applied systematically.

The Open-Source Commitment

Unlike proprietary models that offer only API access, DeepSeek released:

  • Full model weights for V3 and R1 (671B parameters)
  • Distilled models from 1.5B to 70B parameters
  • MIT license permitting commercial use and further distillation
  • Detailed technical reports explaining every design decision

This openness enables researchers and companies to build on DeepSeek’s work rather than treating frontier AI as a black box.

DeepSeek-V3: The Foundation

DeepSeek-V3 is the base model that powers R1’s reasoning capabilities. Understanding its architecture explains how DeepSeek achieved such efficiency.

DeepSeek-V3 Architecture: MoE + MLA

671B total parameters, 37B activated per token

Mixture of Experts (MoE)

V3 uses a sparse MoE architecture with 671B total parameters but only 37B activated per token.

What is Mixture of Experts?

Instead of running every input through all parameters, MoE models route each token to a subset of specialized “expert” networks. V3 has 256 routed experts plus 1 shared expert, activating only 8 routed experts per token. This means 95% of the model’s parameters are dormant for any given token, dramatically reducing compute costs while maintaining capacity.

V3’s MoE Design:

  • 256 routed experts + 1 shared expert
  • 8 experts activated per token
  • Auxiliary-loss-free load balancing
  • Multi-token prediction objective during training
Why “Auxiliary-Loss-Free” Matters

Standard MoE models suffer from “expert collapse”—the router learns to send all tokens to a few experts, wasting the others. The typical fix is adding an auxiliary loss term that penalizes unbalanced expert usage. But this penalty competes with the main training objective, degrading model quality.

DeepSeek’s approach: instead of penalizing through loss, they dynamically adjust expert biases during training. If an expert is underused, its bias is increased to attract more tokens. This achieves balanced routing without any auxiliary loss term, keeping the model focused solely on its primary objective.

Multi-head Latent Attention (MLA)

Standard attention requires storing key-value pairs for every token in context, consuming massive memory for long sequences. MLA compresses these before caching:

  1. Compress key and value tensors into a lower-dimensional latent space
  2. Store the compressed representations in KV cache
  3. Decompress back to full dimension during attention computation

This trades one extra matrix multiplication for substantial memory savings, enabling V3’s 128K context window without prohibitive memory requirements.

FP8 Training at Scale

DeepSeek-V3 was the first model to successfully train at 671B parameters using FP8 (8-bit floating point) precision:

Why FP8 Matters

NVIDIA’s Tensor Cores deliver exactly 2x the FLOPS for FP8 compared to FP16. FP8 also halves memory requirements and reduces communication bandwidth in distributed training. The challenge is maintaining training stability—small precision errors can compound catastrophically at scale.

Block-wise Quantization: DeepSeek uses 128×128 block-wise quantization instead of per-tensor scaling. Each block gets its own scaling factor, allowing outlier activations (which are common in LLMs) to be handled without corrupting the entire tensor’s precision. This fine-grained approach maintains accuracy while capturing FP8’s efficiency gains.

Precision Strategy:

  • FP8 for most matrix multiplications
  • BF16/FP32 retained for: embeddings, output head, MoE gating, normalization, attention operators
  • Tile-based quantization (1x128) instead of row/column level
  • Relative loss error vs BF16 baseline: less than 0.25%

Training Efficiency

MetricDeepSeek-V3
Pre-training tokens14.8 trillion
GPU hours2.788M H800 hours
Training time~2 months on 2,048 H800s
Training restartsZero
Estimated cost~$5.5M

The zero-restart training run is remarkable. Most large model training requires multiple restarts due to instabilities, loss spikes, or hardware failures.

DeepSeek-R1: Pure RL Reasoning

While V3 provides the efficient foundation, R1 demonstrates the breakthrough in reasoning. The key insight: you don’t need supervised fine-tuning on reasoning traces to develop reasoning capabilities.

R1-Zero: The RL-Only Experiment

DeepSeek first trained R1-Zero using pure reinforcement learning from the V3 base model—no supervised fine-tuning whatsoever.

Why This Matters

OpenAI’s o1 was trained on proprietary reasoning traces—step-by-step solutions that demonstrate “how to think.” Creating this data is expensive and requires expert annotation. R1-Zero shows that RL alone, with only outcome-based rewards (correct/incorrect), can develop sophisticated reasoning behaviors emergently.

R1-Zero Emergent Behaviors:

  • Extended thinking chains (solving problems step-by-step)
  • Self-verification (checking intermediate results)
  • Reflection (reconsidering approaches when stuck)
  • Exploration of multiple solution paths

R1-Zero Limitations:

  • Language mixing (switching languages mid-response)
  • Readability issues (repetitive or poorly formatted outputs)
  • Inconsistent reasoning quality

These limitations motivated the full R1 training pipeline.

DeepSeek-R1: 4-Stage Training Pipeline

From base model to frontier reasoning via pure reinforcement learning

The Training Recipe

DeepSeek-R1 uses a 4-stage training pipeline that builds on the R1-Zero insights while addressing its limitations:

Stage 1: Cold-Start SFT

Before RL training, R1 receives supervised fine-tuning on a small set of high-quality reasoning examples. These include:

  • Synthetic reasoning traces generated by R1-Zero (filtered for quality)
  • Human-annotated solutions for challenging problems
  • Examples demonstrating proper formatting (<think> and <answer> tags)

This “cold start” gives the model a reasonable starting point for RL, avoiding the chaotic early training of R1-Zero.

Stage 2: Large-Scale RL

The core training phase uses Group Relative Policy Optimization (GRPO) on verifiable problems:

What is GRPO?

GRPO is a variant of PPO (Proximal Policy Optimization) that calculates advantages using Monte Carlo estimates from a group of sampled responses. Instead of training a separate critic/value network, GRPO evaluates each response relative to its peers in the same batch. This simplifies training infrastructure while maintaining RL effectiveness.

Reward Structure:

  • Accuracy rewards: Binary signal for correct/incorrect answers (for math, code with test cases)
  • Format rewards: Compliance with <think>...</think> and <answer>...</answer> structure
  • Language consistency: Penalties for switching languages mid-response

RL training continues until convergence on the verifiable problem set.

Stage 3: Rejection Sampling

After RL, DeepSeek generates many candidate responses and filters for quality:

  • 75% reasoning-heavy problems (math, code, logic)
  • 25% general queries (writing, knowledge, conversation)

This creates a high-quality dataset for the final training stage.

Stage 4: Final RL with Preference Tuning

The last stage combines:

  • Continued RL on verifiable problems
  • Preference-based training (reward model) for non-verifiable outputs
  • Alignment with human preferences for helpfulness and safety

Why RL Beats SFT for Reasoning

A key finding from DeepSeek’s research: RL generalizes better than SFT for reasoning tasks.

Training MethodIn-DistributionOut-of-Distribution
Supervised Fine-TuningHigh accuracyPoor generalization
Reinforcement LearningHigh accuracyStrong generalization

SFT essentially memorizes reasoning patterns from training data. RL, by optimizing for outcomes (correct answers), develops more robust reasoning strategies that transfer to novel problems.

Benchmark Results

DeepSeek-R1 vs OpenAI o1: Benchmark Comparison

Performance on reasoning benchmarks (% accuracy, Codeforces normalized to 100)

Mathematical Reasoning

BenchmarkDeepSeek-R1OpenAI o1-1217o1-mini
AIME 202479.8%79.2%63.6%
MATH-50097.3%96.4%90.0%
CNMO 202478.8%84.6%-

DeepSeek-R1 matches or exceeds o1 on most mathematical benchmarks, with slight advantages on AIME and MATH-500.

Coding

BenchmarkDeepSeek-R1OpenAI o1-1217
Codeforces Rating2,0292,061
LiveCodeBench65.9%-
SWE-bench Verified49.2%48.9%

Competitive coding performance, with R1 slightly behind on Codeforces but ahead on SWE-bench.

General Reasoning

BenchmarkDeepSeek-R1OpenAI o1-1217
GPQA Diamond71.5%75.7%
MMLU90.8%91.8%

O1 maintains an edge on general knowledge benchmarks, suggesting DeepSeek’s training emphasized mathematical/coding reasoning.

Distilled Models

DeepSeek released six distilled models that bring R1’s reasoning to smaller, more deployable architectures:

ModelBaseParametersAIME 2024MATH-500
R1-Distill-Qwen-1.5BQwen 2.51.5B28.9%83.9%
R1-Distill-Qwen-7BQwen 2.57B55.5%92.8%
R1-Distill-Qwen-14BQwen 2.514B69.7%93.9%
R1-Distill-Qwen-32BQwen 2.532B72.6%94.3%
R1-Distill-Llama-8BLlama 3.18B50.4%89.1%
R1-Distill-Llama-70BLlama 3.370B70.0%94.5%
The 32B Sweet Spot

R1-Distill-Qwen-32B is particularly notable: it outperforms o1-mini (72.6% vs 63.6% on AIME) while being fully open-source and runnable on consumer hardware. This model represents the best reasoning-per-dollar available in January 2025.

Distillation Methodology

The distilled models were created by fine-tuning existing open-source models (Qwen, Llama) on R1’s reasoning outputs. This transfers R1’s reasoning patterns to smaller architectures without requiring the full RL training pipeline.

License Implications:

  • DeepSeek-R1 and V3 are MIT licensed—one of the most permissive licenses available
  • Distilled Qwen models inherit Apache 2.0 license
  • Distilled Llama models inherit respective Llama licenses
MIT License: Why This Matters for Distillation

The MIT license explicitly permits using R1’s outputs (reasoning traces) to train your own models—commercially, without attribution requirements beyond the license file. This is remarkably permissive compared to “open-weight” models with restrictive licenses that prohibit commercial distillation.

Practically: you can legally fine-tune a small model on R1’s reasoning traces and deploy it in your product. Many “open” models don’t allow this. DeepSeek does.

Business Implications

DeepSeek’s release has significant ramifications across the AI industry.

For AI Startups

Reasoning Capabilities Democratized: You no longer need OpenAI API access (and associated costs) for frontier reasoning. Deploy R1-Distill-32B on your own infrastructure for predictable costs.

Competitive Positioning: Startups can now offer reasoning-powered features without margin compression from API costs. The playing field with well-funded competitors has leveled.

Build vs Buy Calculus: With open weights, you can fine-tune for your specific domain. A legal tech startup can specialize R1-Distill for contract analysis without starting from scratch.

For Enterprise AI Teams

Cost Predictability: API-based reasoning models have unpredictable costs that scale with usage. Self-hosted R1 variants offer fixed infrastructure costs regardless of query volume.

Data Privacy: Reasoning over sensitive documents (legal, medical, financial) without sending data to external APIs. On-premise deployment eliminates data residency concerns.

Model Customization: Fine-tune on proprietary data to improve domain-specific reasoning. The open weights make this possible in ways that API-only models don’t allow.

For AI Researchers

Reproducible Baselines: The detailed technical reports and open weights enable genuine scientific comparison. Claims about training methodologies can be verified.

RL Research Acceleration: The demonstration that pure RL develops reasoning opens new research directions. The GRPO methodology and reward structures are documented for replication.

Efficient Training Recipes: FP8 training, MLA, and auxiliary-loss-free MoE are now proven techniques that others can adopt.

For Incumbent AI Labs

Moat Erosion: The cost and capability gap between frontier labs and the rest of the industry narrowed significantly. Proprietary advantages in training efficiency are now known.

Open-Source Competition: Models that were API-only now face open-source alternatives at comparable quality. Pricing power diminishes when alternatives exist.

Strategic Response Required: Labs may need to compete on factors other than raw capability: reliability, safety guarantees, enterprise support, ecosystem integration.

For Hardware Vendors

FP8 Validation: DeepSeek’s success with FP8 training validates NVIDIA’s investment in low-precision compute. Demand for FP8-capable hardware may accelerate.

Memory Bandwidth Over Compute: MLA’s KV cache compression shifts the inference bottleneck. Standard transformers are often memory-bound during inference—waiting on memory bandwidth to fetch KV cache entries. MLA reduces KV cache size by 5-10×, meaning hardware with high memory bandwidth (like NVIDIA’s H100/H200) becomes even more valuable for long-context inference. Edge deployment hardware should prioritize memory bandwidth, not just FLOPS.

Efficiency Over Scale: The cost reduction demonstrates that algorithmic innovation can substitute for hardware scaling. This may affect demand projections for massive GPU clusters.

Limitations

Reasoning Scope

DeepSeek-R1 excels at verifiable problems (math, code) but shows less improvement on tasks requiring:

  • Nuanced judgment without clear right answers
  • Long-horizon planning with uncertain outcomes
  • Creative tasks where “correctness” is subjective

Language Mixing

Despite mitigation efforts, R1 occasionally switches languages mid-response, particularly when the prompt language differs from the dominant language in training data.

Context Length Trade-offs

While V3 supports 128K context, reasoning quality degrades on very long problems. The RL training focused on problems that fit in shorter contexts.

Small Model Limitations

The distilled models, while impressive, don’t fully capture R1’s capabilities. The paper notes that “smaller models struggle with RL-only reasoning training,” suggesting a minimum capability threshold for effective reasoning development.

Safety Considerations

As an open-weights model, R1 can be fine-tuned to remove safety guardrails. Organizations deploying R1 variants must implement their own safety measures rather than relying on provider-side controls.

Conclusion

DeepSeek-R1 and V3 represent a watershed moment in AI development. The combination of frontier capabilities, radical cost efficiency, and open-source release challenges assumptions about who can build advanced AI and at what cost.

Key Takeaways:

  1. Pure RL works: Reasoning capabilities can emerge from reinforcement learning without expensive supervised fine-tuning on reasoning traces

  2. Efficiency beats scale: Algorithmic innovation (FP8, MLA, auxiliary-loss-free MoE) can achieve comparable results at 1/20th the cost

  3. Open-source is competitive: The best open reasoning model now matches the best proprietary one

  4. Distillation democratizes: The 32B distilled model brings frontier reasoning to consumer hardware

  5. The moat is narrowing: Proprietary advantages in AI capability are increasingly difficult to maintain

For practitioners, the immediate implication is clear: frontier reasoning is now accessible without frontier budgets. The question is no longer whether you can afford reasoning AI, but how you’ll apply it.


Original paper: arXivPDFHTML

DeepSeek-V3: arXiv:2412.19437

Code & Weights: GitHub | Hugging Face

Cite this paper

DeepSeek-AI (2025). DeepSeek-R1 & V3: The Open-Source Reasoning Revolution. arXiv 2025.