Research Overview
In January 2025, a Chinese AI lab called DeepSeek released two papers that sent shockwaves through the AI industry. The message was clear: frontier AI capabilities don’t require frontier budgets.
DeepSeek-V3 demonstrated that a 671B parameter model could be trained for approximately $5.5 million—roughly 1/20th of what comparable models were rumored to cost. DeepSeek-R1 then showed that pure reinforcement learning, without expensive supervised fine-tuning on reasoning traces, could match OpenAI’s o1 on mathematical and coding benchmarks.
Downloads of DeepSeek models increased nearly 1000% following the R1 release. The papers challenged the assumption that only well-funded Western labs could produce frontier AI, and demonstrated that algorithmic innovation could substitute for raw compute spending.
Most significantly, DeepSeek open-sourced everything: model weights, training recipes, and distilled variants. This wasn’t just a research contribution—it was a democratization of reasoning AI.
R1-Zero: An experimental model trained with pure RL only (no supervised fine-tuning). Demonstrates emergent reasoning but has quality issues—not for production use.
R1: The production model. Uses “Cold Start” SFT (on curated R1-Zero outputs) followed by large-scale RL training. This is the 671B model that matches o1.
Distilled Models (1.5B–70B): Created via supervised fine-tuning on R1’s reasoning traces—NOT trained with RL. They mimic R1’s reasoning patterns but don’t learn through reinforcement. Many developers mistakenly assume the 32B model uses RL; it doesn’t.
Key Results at a Glance
| Model | Parameters | AIME 2024 | MATH-500 | GPQA Diamond | Cost |
|---|---|---|---|---|---|
| OpenAI o1-1217 | Unknown | 79.2% | 96.4% | 75.7% | API only |
| DeepSeek-R1 | 671B (37B active) | 79.8% | 97.3% | 71.5% | Open weights |
| DeepSeek-R1-Distill-32B | 32B | 72.6% | 94.3% | 62.1% | Open weights |
| OpenAI o1-mini | Unknown | 63.6% | 90.0% | 60.0% | API only |
DeepSeek-R1 matches or exceeds o1 on mathematical reasoning while being completely open-source.
Why DeepSeek Matters
The Cost Revolution
Training large language models was assumed to require hundreds of millions of dollars. DeepSeek-V3 challenged this:
| Aspect | Industry Assumption | DeepSeek-V3 Reality |
|---|---|---|
| Training cost | $100M+ | ~$5.5M |
| GPU hours | 10M+ H100 hours | 2.788M H800 hours |
| Training stability | Requires many restarts | Zero rollbacks |
| Precision | BF16/FP32 | FP8 (first at scale) |
Three innovations drove the cost reduction: (1) FP8 mixed-precision training that doubles effective compute, (2) Multi-head Latent Attention that reduces memory requirements, and (3) an auxiliary-loss-free load balancing strategy for MoE that improves training stability. None of these are “secret sauce”—they’re engineering excellence applied systematically.
The Open-Source Commitment
Unlike proprietary models that offer only API access, DeepSeek released:
- Full model weights for V3 and R1 (671B parameters)
- Distilled models from 1.5B to 70B parameters
- MIT license permitting commercial use and further distillation
- Detailed technical reports explaining every design decision
This openness enables researchers and companies to build on DeepSeek’s work rather than treating frontier AI as a black box.
DeepSeek-V3: The Foundation
DeepSeek-V3 is the base model that powers R1’s reasoning capabilities. Understanding its architecture explains how DeepSeek achieved such efficiency.
DeepSeek-V3 Architecture: MoE + MLA
671B total parameters, 37B activated per token
Mixture of Experts (MoE)
V3 uses a sparse MoE architecture with 671B total parameters but only 37B activated per token.
Instead of running every input through all parameters, MoE models route each token to a subset of specialized “expert” networks. V3 has 256 routed experts plus 1 shared expert, activating only 8 routed experts per token. This means 95% of the model’s parameters are dormant for any given token, dramatically reducing compute costs while maintaining capacity.
V3’s MoE Design:
- 256 routed experts + 1 shared expert
- 8 experts activated per token
- Auxiliary-loss-free load balancing
- Multi-token prediction objective during training
Standard MoE models suffer from “expert collapse”—the router learns to send all tokens to a few experts, wasting the others. The typical fix is adding an auxiliary loss term that penalizes unbalanced expert usage. But this penalty competes with the main training objective, degrading model quality.
DeepSeek’s approach: instead of penalizing through loss, they dynamically adjust expert biases during training. If an expert is underused, its bias is increased to attract more tokens. This achieves balanced routing without any auxiliary loss term, keeping the model focused solely on its primary objective.
Multi-head Latent Attention (MLA)
Standard attention requires storing key-value pairs for every token in context, consuming massive memory for long sequences. MLA compresses these before caching:
- Compress key and value tensors into a lower-dimensional latent space
- Store the compressed representations in KV cache
- Decompress back to full dimension during attention computation
This trades one extra matrix multiplication for substantial memory savings, enabling V3’s 128K context window without prohibitive memory requirements.
FP8 Training at Scale
DeepSeek-V3 was the first model to successfully train at 671B parameters using FP8 (8-bit floating point) precision:
NVIDIA’s Tensor Cores deliver exactly 2x the FLOPS for FP8 compared to FP16. FP8 also halves memory requirements and reduces communication bandwidth in distributed training. The challenge is maintaining training stability—small precision errors can compound catastrophically at scale.
Block-wise Quantization: DeepSeek uses 128×128 block-wise quantization instead of per-tensor scaling. Each block gets its own scaling factor, allowing outlier activations (which are common in LLMs) to be handled without corrupting the entire tensor’s precision. This fine-grained approach maintains accuracy while capturing FP8’s efficiency gains.
Precision Strategy:
- FP8 for most matrix multiplications
- BF16/FP32 retained for: embeddings, output head, MoE gating, normalization, attention operators
- Tile-based quantization (1x128) instead of row/column level
- Relative loss error vs BF16 baseline: less than 0.25%
Training Efficiency
| Metric | DeepSeek-V3 |
|---|---|
| Pre-training tokens | 14.8 trillion |
| GPU hours | 2.788M H800 hours |
| Training time | ~2 months on 2,048 H800s |
| Training restarts | Zero |
| Estimated cost | ~$5.5M |
The zero-restart training run is remarkable. Most large model training requires multiple restarts due to instabilities, loss spikes, or hardware failures.
DeepSeek-R1: Pure RL Reasoning
While V3 provides the efficient foundation, R1 demonstrates the breakthrough in reasoning. The key insight: you don’t need supervised fine-tuning on reasoning traces to develop reasoning capabilities.
R1-Zero: The RL-Only Experiment
DeepSeek first trained R1-Zero using pure reinforcement learning from the V3 base model—no supervised fine-tuning whatsoever.
OpenAI’s o1 was trained on proprietary reasoning traces—step-by-step solutions that demonstrate “how to think.” Creating this data is expensive and requires expert annotation. R1-Zero shows that RL alone, with only outcome-based rewards (correct/incorrect), can develop sophisticated reasoning behaviors emergently.
R1-Zero Emergent Behaviors:
- Extended thinking chains (solving problems step-by-step)
- Self-verification (checking intermediate results)
- Reflection (reconsidering approaches when stuck)
- Exploration of multiple solution paths
R1-Zero Limitations:
- Language mixing (switching languages mid-response)
- Readability issues (repetitive or poorly formatted outputs)
- Inconsistent reasoning quality
These limitations motivated the full R1 training pipeline.
DeepSeek-R1: 4-Stage Training Pipeline
From base model to frontier reasoning via pure reinforcement learning
The Training Recipe
DeepSeek-R1 uses a 4-stage training pipeline that builds on the R1-Zero insights while addressing its limitations:
Stage 1: Cold-Start SFT
Before RL training, R1 receives supervised fine-tuning on a small set of high-quality reasoning examples. These include:
- Synthetic reasoning traces generated by R1-Zero (filtered for quality)
- Human-annotated solutions for challenging problems
- Examples demonstrating proper formatting (
<think>and<answer>tags)
This “cold start” gives the model a reasonable starting point for RL, avoiding the chaotic early training of R1-Zero.
Stage 2: Large-Scale RL
The core training phase uses Group Relative Policy Optimization (GRPO) on verifiable problems:
GRPO is a variant of PPO (Proximal Policy Optimization) that calculates advantages using Monte Carlo estimates from a group of sampled responses. Instead of training a separate critic/value network, GRPO evaluates each response relative to its peers in the same batch. This simplifies training infrastructure while maintaining RL effectiveness.
Reward Structure:
- Accuracy rewards: Binary signal for correct/incorrect answers (for math, code with test cases)
- Format rewards: Compliance with
<think>...</think>and<answer>...</answer>structure - Language consistency: Penalties for switching languages mid-response
RL training continues until convergence on the verifiable problem set.
Stage 3: Rejection Sampling
After RL, DeepSeek generates many candidate responses and filters for quality:
- 75% reasoning-heavy problems (math, code, logic)
- 25% general queries (writing, knowledge, conversation)
This creates a high-quality dataset for the final training stage.
Stage 4: Final RL with Preference Tuning
The last stage combines:
- Continued RL on verifiable problems
- Preference-based training (reward model) for non-verifiable outputs
- Alignment with human preferences for helpfulness and safety
Why RL Beats SFT for Reasoning
A key finding from DeepSeek’s research: RL generalizes better than SFT for reasoning tasks.
| Training Method | In-Distribution | Out-of-Distribution |
|---|---|---|
| Supervised Fine-Tuning | High accuracy | Poor generalization |
| Reinforcement Learning | High accuracy | Strong generalization |
SFT essentially memorizes reasoning patterns from training data. RL, by optimizing for outcomes (correct answers), develops more robust reasoning strategies that transfer to novel problems.
Benchmark Results
DeepSeek-R1 vs OpenAI o1: Benchmark Comparison
Performance on reasoning benchmarks (% accuracy, Codeforces normalized to 100)
Mathematical Reasoning
| Benchmark | DeepSeek-R1 | OpenAI o1-1217 | o1-mini |
|---|---|---|---|
| AIME 2024 | 79.8% | 79.2% | 63.6% |
| MATH-500 | 97.3% | 96.4% | 90.0% |
| CNMO 2024 | 78.8% | 84.6% | - |
DeepSeek-R1 matches or exceeds o1 on most mathematical benchmarks, with slight advantages on AIME and MATH-500.
Coding
| Benchmark | DeepSeek-R1 | OpenAI o1-1217 |
|---|---|---|
| Codeforces Rating | 2,029 | 2,061 |
| LiveCodeBench | 65.9% | - |
| SWE-bench Verified | 49.2% | 48.9% |
Competitive coding performance, with R1 slightly behind on Codeforces but ahead on SWE-bench.
General Reasoning
| Benchmark | DeepSeek-R1 | OpenAI o1-1217 |
|---|---|---|
| GPQA Diamond | 71.5% | 75.7% |
| MMLU | 90.8% | 91.8% |
O1 maintains an edge on general knowledge benchmarks, suggesting DeepSeek’s training emphasized mathematical/coding reasoning.
Distilled Models
DeepSeek released six distilled models that bring R1’s reasoning to smaller, more deployable architectures:
| Model | Base | Parameters | AIME 2024 | MATH-500 |
|---|---|---|---|---|
| R1-Distill-Qwen-1.5B | Qwen 2.5 | 1.5B | 28.9% | 83.9% |
| R1-Distill-Qwen-7B | Qwen 2.5 | 7B | 55.5% | 92.8% |
| R1-Distill-Qwen-14B | Qwen 2.5 | 14B | 69.7% | 93.9% |
| R1-Distill-Qwen-32B | Qwen 2.5 | 32B | 72.6% | 94.3% |
| R1-Distill-Llama-8B | Llama 3.1 | 8B | 50.4% | 89.1% |
| R1-Distill-Llama-70B | Llama 3.3 | 70B | 70.0% | 94.5% |
R1-Distill-Qwen-32B is particularly notable: it outperforms o1-mini (72.6% vs 63.6% on AIME) while being fully open-source and runnable on consumer hardware. This model represents the best reasoning-per-dollar available in January 2025.
Distillation Methodology
The distilled models were created by fine-tuning existing open-source models (Qwen, Llama) on R1’s reasoning outputs. This transfers R1’s reasoning patterns to smaller architectures without requiring the full RL training pipeline.
License Implications:
- DeepSeek-R1 and V3 are MIT licensed—one of the most permissive licenses available
- Distilled Qwen models inherit Apache 2.0 license
- Distilled Llama models inherit respective Llama licenses
The MIT license explicitly permits using R1’s outputs (reasoning traces) to train your own models—commercially, without attribution requirements beyond the license file. This is remarkably permissive compared to “open-weight” models with restrictive licenses that prohibit commercial distillation.
Practically: you can legally fine-tune a small model on R1’s reasoning traces and deploy it in your product. Many “open” models don’t allow this. DeepSeek does.
Business Implications
DeepSeek’s release has significant ramifications across the AI industry.
For AI Startups
Reasoning Capabilities Democratized: You no longer need OpenAI API access (and associated costs) for frontier reasoning. Deploy R1-Distill-32B on your own infrastructure for predictable costs.
Competitive Positioning: Startups can now offer reasoning-powered features without margin compression from API costs. The playing field with well-funded competitors has leveled.
Build vs Buy Calculus: With open weights, you can fine-tune for your specific domain. A legal tech startup can specialize R1-Distill for contract analysis without starting from scratch.
For Enterprise AI Teams
Cost Predictability: API-based reasoning models have unpredictable costs that scale with usage. Self-hosted R1 variants offer fixed infrastructure costs regardless of query volume.
Data Privacy: Reasoning over sensitive documents (legal, medical, financial) without sending data to external APIs. On-premise deployment eliminates data residency concerns.
Model Customization: Fine-tune on proprietary data to improve domain-specific reasoning. The open weights make this possible in ways that API-only models don’t allow.
For AI Researchers
Reproducible Baselines: The detailed technical reports and open weights enable genuine scientific comparison. Claims about training methodologies can be verified.
RL Research Acceleration: The demonstration that pure RL develops reasoning opens new research directions. The GRPO methodology and reward structures are documented for replication.
Efficient Training Recipes: FP8 training, MLA, and auxiliary-loss-free MoE are now proven techniques that others can adopt.
For Incumbent AI Labs
Moat Erosion: The cost and capability gap between frontier labs and the rest of the industry narrowed significantly. Proprietary advantages in training efficiency are now known.
Open-Source Competition: Models that were API-only now face open-source alternatives at comparable quality. Pricing power diminishes when alternatives exist.
Strategic Response Required: Labs may need to compete on factors other than raw capability: reliability, safety guarantees, enterprise support, ecosystem integration.
For Hardware Vendors
FP8 Validation: DeepSeek’s success with FP8 training validates NVIDIA’s investment in low-precision compute. Demand for FP8-capable hardware may accelerate.
Memory Bandwidth Over Compute: MLA’s KV cache compression shifts the inference bottleneck. Standard transformers are often memory-bound during inference—waiting on memory bandwidth to fetch KV cache entries. MLA reduces KV cache size by 5-10×, meaning hardware with high memory bandwidth (like NVIDIA’s H100/H200) becomes even more valuable for long-context inference. Edge deployment hardware should prioritize memory bandwidth, not just FLOPS.
Efficiency Over Scale: The cost reduction demonstrates that algorithmic innovation can substitute for hardware scaling. This may affect demand projections for massive GPU clusters.
Limitations
Reasoning Scope
DeepSeek-R1 excels at verifiable problems (math, code) but shows less improvement on tasks requiring:
- Nuanced judgment without clear right answers
- Long-horizon planning with uncertain outcomes
- Creative tasks where “correctness” is subjective
Language Mixing
Despite mitigation efforts, R1 occasionally switches languages mid-response, particularly when the prompt language differs from the dominant language in training data.
Context Length Trade-offs
While V3 supports 128K context, reasoning quality degrades on very long problems. The RL training focused on problems that fit in shorter contexts.
Small Model Limitations
The distilled models, while impressive, don’t fully capture R1’s capabilities. The paper notes that “smaller models struggle with RL-only reasoning training,” suggesting a minimum capability threshold for effective reasoning development.
Safety Considerations
As an open-weights model, R1 can be fine-tuned to remove safety guardrails. Organizations deploying R1 variants must implement their own safety measures rather than relying on provider-side controls.
Conclusion
DeepSeek-R1 and V3 represent a watershed moment in AI development. The combination of frontier capabilities, radical cost efficiency, and open-source release challenges assumptions about who can build advanced AI and at what cost.
Key Takeaways:
-
Pure RL works: Reasoning capabilities can emerge from reinforcement learning without expensive supervised fine-tuning on reasoning traces
-
Efficiency beats scale: Algorithmic innovation (FP8, MLA, auxiliary-loss-free MoE) can achieve comparable results at 1/20th the cost
-
Open-source is competitive: The best open reasoning model now matches the best proprietary one
-
Distillation democratizes: The 32B distilled model brings frontier reasoning to consumer hardware
-
The moat is narrowing: Proprietary advantages in AI capability are increasingly difficult to maintain
For practitioners, the immediate implication is clear: frontier reasoning is now accessible without frontier budgets. The question is no longer whether you can afford reasoning AI, but how you’ll apply it.
Original paper: arXiv ・ PDF ・ HTML
DeepSeek-V3: arXiv:2412.19437
Code & Weights: GitHub | Hugging Face
Cite this paper
DeepSeek-AI (2025). DeepSeek-R1 & V3: The Open-Source Reasoning Revolution. arXiv 2025.