-
Simplicity Wins: Kimi k1.5 matches OpenAI o1 on math (77.5 AIME) and coding (94th percentile Codeforces) without MCTS, value functions, or process reward models — just policy optimization with length penalties
-
128K Context Matters: Extended context lets models "think longer" on hard problems. The short-CoT variant achieves 60.8 AIME with only ~3,000 tokens — 550% better than GPT-4o
-
The Implication: Frontier reasoning doesn't require frontier complexity. Teams can achieve o1-level performance with simpler, more reproducible training methods
Research Overview
OpenAI's o1 model demonstrated that letting language models "think longer" through chain-of-thought reasoning dramatically improves performance on hard problems. But o1 remained closed, leaving researchers to speculate about what techniques made it work. Many assumed complex methods like Monte Carlo tree search (MCTS), learned value functions, or process reward models were essential.
Kimi k1.5 challenges this assumption. Moonshot AI's team achieved o1-level performance using a surprisingly simple approach: standard policy optimization with careful length management. No tree search. No value networks. No step-by-step reward models.
Complex RL techniques like MCTS and value functions were designed for games with clear state spaces (chess, Go). Language reasoning is different—the "state" is a partial text response, and the action space is essentially infinite. Kimi k1.5 shows that simpler methods work better when adapted properly for language.
Key Results at a Glance
| Benchmark | Kimi k1.5 | OpenAI o1 | What It Tests |
|---|---|---|---|
| AIME 2024 | 77.5 | 74.4 | Competition math (hardest) |
| MATH-500 | 96.2 | 94.8 | Mathematical reasoning |
| Codeforces | 94th %ile | 94th %ile | Competitive programming |
| MathVista | 74.9 | 71.0 | Visual math reasoning |
| LiveCodeBench | 62.5 | 67.2 | Real-world coding |
Long-CoT Benchmark Comparison: Kimi k1.5 vs OpenAI o1
Both models achieve similar performance, but k1.5 uses simpler training methods
Why This Matters
The Reasoning Revolution
Traditional language models predict the next token based on patterns in training data. They're essentially very sophisticated pattern matchers. But hard problems—competition math, complex code, scientific reasoning—require more than pattern matching. They require thinking.
Instead of jumping directly to an answer, chain-of-thought prompts models to "show their work"—writing out intermediate reasoning steps. This simple technique dramatically improves accuracy on complex problems because each step can be checked and corrected before proceeding.
The challenge is training models to produce good chains of thought. Reinforcement learning offers a path: reward correct final answers, let the model learn what reasoning patterns lead to success. But RL for language is notoriously unstable and sample-inefficient.
The Complexity Trap
After o1's release, many teams rushed to replicate it using increasingly complex techniques:
- Monte Carlo Tree Search: Explore multiple reasoning paths, backtrack when stuck
- Value Functions: Learn to predict success probability at each step
- Process Reward Models: Score intermediate reasoning steps, not just final answers
- Beam Search with Verification: Generate multiple candidates, filter with trained verifiers
Kimi k1.5 shows these additions may be unnecessary overhead. The team achieved comparable results with just:
- Policy optimization (standard RL)
- Length penalties (prevent overthinking)
- Long context training (let the model think deeply when needed)
The Simplicity Principle
Policy Optimization Without Extras
The Kimi k1.5 training framework uses online mirror descent with KL regularization—a standard RL algorithm. For each problem, the model generates multiple response attempts. Correct answers receive positive reward; incorrect answers receive negative reward.
The Simplicity Approach: What k1.5 Uses vs Removes
Achieving o1-level performance with fewer components
Value functions predict "how likely am I to succeed from this point?" In games like chess, this is useful because you can evaluate positions. In language, the "position" is a partial response, and evaluating it requires essentially solving the whole problem. The Kimi team argues that exploring more responses is more valuable than trying to predict their success.
The key innovations are in the details:
Negative Gradients Matter: Unlike simpler methods that only learn from successes (rejection sampling), k1.5 explicitly penalizes incorrect responses. This "push away from wrong answers" signal accelerates learning.
Curriculum Sampling: Start training on a mix of easy and hard problems. As training progresses, shift focus to harder problems where the model still makes mistakes. This prevents wasting compute on already-solved easy cases.
Length Penalty Warmup: Gradually increase the penalty for overly long responses. Starting with length penalties too high prevents the model from learning to reason deeply; too low leads to rambling.
What They Removed
The paper explicitly notes what they did not use:
| Technique | Status | Reasoning |
|---|---|---|
| Monte Carlo Tree Search | Removed | "Not suitable for our context" |
| Value Functions | Removed | Exploration more valuable than prediction |
| Process Reward Models | Removed | End-to-end learning sufficient |
| Beam Search | Removed | Simple sampling works |
MCTS requires expensive computation during inference—every user query triggers a search through multiple reasoning paths. Kimi's approach moves this cost to training time: invest heavily in learning a better policy upfront, so the model can respond directly without searching. This is a massive difference for deployment: one API call vs. potentially hundreds of internal searches per query.
This simplification isn't just philosophical—it has practical benefits. Fewer components means faster iteration, easier debugging, and more stable training.
Long Context Reasoning
The 128K Token Advantage
Kimi k1.5 scales context to 128,000 tokens—about 100 pages of text. This matters because harder problems require longer reasoning chains.
Complex math problems might require dozens of intermediate steps. A model limited to short responses must compress or skip steps, losing accuracy. With 128K tokens, k1.5 can write out every step, check its work, try alternative approaches, and still have room for the final answer.
The paper shows a strong correlation: problems requiring longer reasoning chains benefit most from long context. On AIME (the hardest math benchmark), the average successful response is significantly longer than on easier problems.
Partial Rollouts: Making Long Context Practical
Training with 128K token responses is computationally expensive. A naive approach would be prohibitively slow. The Kimi team developed "partial rollouts" to solve this:
- Set a fixed token budget per training step (e.g., 8K tokens)
- If a response exceeds this budget, pause and save progress
- Resume from the saved checkpoint in the next iteration
- Previous segments are cached and reused, not recomputed
Think of it like CPU time-slicing in operating systems. Just as an OS switches between tasks so no single process blocks the CPU, partial rollouts let GPUs work on many chains of thought in parallel. A 50,000-token response doesn't block a GPU for minutes—it gets "time-sliced" across multiple training steps while other responses complete. No memory overflows, no idle workers.
The system also includes repeat detection: if the model starts generating repetitive content (a common failure mode in long generation), it terminates early and applies a penalty. This prevents wasted compute on degenerate responses.
Long2Short: Policy Distillation for Efficiency
One concern with long-chain-of-thought models: they're expensive at inference time. A response that takes 10,000 tokens costs 10x more than a 1,000 token response. For many applications, this overhead is unacceptable.
Kimi k1.5 introduces "Long2Short" methods—a form of policy distillation that transfers the reasoning capabilities of a slow, powerful "teacher" model into a fast, efficient "student" model.
In traditional knowledge distillation, a small model learns to mimic a large model's outputs. Policy distillation extends this to RL: the student learns not just what answers to give, but what reasoning strategies lead to correct answers. The long-CoT model's "policy" (its learned behavior) gets compressed into the short-CoT model.
Short-CoT Performance: k1.5-short vs Competitors
Efficient reasoning without long chain-of-thought. AIME scores shown (higher = better)
Four Complementary Approaches
1. Model Merging The simplest approach: average the weights of the long-CoT model with a standard short-response model. No additional training required. This works surprisingly well as a baseline.
2. Shortest Rejection Sampling Generate 8 responses for each problem. Keep only the shortest correct response for supervised fine-tuning. This teaches the model that concise correct answers are preferred over verbose ones.
3. Direct Preference Optimization (DPO) Train the model to prefer shorter responses using pairs:
- Positive: shortest correct solution
- Negative: longer responses (both incorrect AND correct-but-verbose)
This explicitly teaches the model that length itself is undesirable.
4. Long2Short RL A dedicated RL phase with aggressive length penalties and reduced maximum response length. The model learns to compress its reasoning while maintaining accuracy.
Efficiency Results
The short-CoT variant (k1.5-short) achieves remarkable efficiency:
Token Efficiency: Score vs Tokens Used
k1.5-short achieves 6.5x better scores than GPT-4o with similar token counts
| Model | AIME Score | Avg Tokens | Tokens per Point |
|---|---|---|---|
| k1.5 (long) | 77.5 | ~32,000 | ~413 |
| k1.5 (short) | 60.8 | 3,272 | ~54 |
| GPT-4o | 9.3 | ~2,000 | ~215 |
| Claude 3.5 | 16.0 | ~2,500 | ~156 |
The short model uses 10x fewer tokens than the long model while retaining 78% of its performance. Compared to GPT-4o, it achieves 6.5x higher scores with similar token counts.
For deployment, you can choose your tradeoff. Need maximum accuracy on the hardest problems? Use the long model. Need cost efficiency for high-volume applications? The short model outperforms competitors at a fraction of the compute.
Multi-Modal Capabilities
Kimi k1.5 isn't just a text model—it processes images too. The multi-modal training enables visual reasoning tasks like chart interpretation, geometry problems, and scientific diagrams.
Vision RL Training
The team curated three types of visual training data:
Real-World Data: Science questions with diagrams, location inference from photos, chart and graph analysis. These teach practical visual understanding.
Synthetic Visual Reasoning: Procedurally generated images testing spatial relationships, geometric properties, and visual logic. Perfect labels enable precise training signal.
Text-Rendered Data: Textual content converted to images. This maintains reasoning capabilities when information is presented visually (common in documents and slides).
Visual Benchmark Results
| Benchmark | Kimi k1.5 | OpenAI o1 | Description |
|---|---|---|---|
| MathVista | 74.9 | 71.0 | Math with diagrams |
| MMMU | 70.0 | - | Multi-modal understanding |
The model processes approximately 1 million text-vision examples during supervised fine-tuning, covering chart interpretation, OCR, image-grounded conversations, visual coding, and visual reasoning.
Benchmark Results
Long-CoT Performance
On benchmarks requiring deep reasoning, k1.5 matches or exceeds o1:
The American Invitational Mathematics Examination is one of the hardest math competitions, with problems that challenge the best human mathematicians. A score of 77.5 (out of 100) would place in the top tier of human competitors. For context, average high school math olympiad participants score around 20-30.
The Codeforces result is particularly notable: 94th percentile means the model outperforms 94% of human competitive programmers on timed algorithm challenges.
Short-CoT Dominance
The real breakthrough is in efficient reasoning. The short-CoT model dramatically outperforms existing models:
| Model | AIME | MATH-500 | LiveCodeBench |
|---|---|---|---|
| k1.5-short | 60.8 | 94.6 | 47.3 |
| Claude 3.5 Sonnet | 16.0 | 78.3 | 36.3 |
| GPT-4o | 9.3 | 74.6 | 34.2 |
| DeepSeek V3 | 39.2 | 90.2 | 37.6 |
The "+550%" improvement claim comes from comparing k1.5-short's AIME score (60.8) to GPT-4o's (9.3): a 6.5x improvement.
Infrastructure Innovations
Hybrid Training-Inference Framework
Training RL models requires both training (updating weights) and inference (generating responses). Traditional setups switch between these modes, wasting time on transitions. Kimi's hybrid framework runs both simultaneously:
- Megatron handles training in dedicated containers
- vLLM handles inference in separate containers
- A checkpoint-engine coordinates weight synchronization
Transitions between training and inference take less than one minute (down from potentially hours). This enables rapid iteration: try a training change, generate samples, evaluate, repeat. Fast cycles accelerate research progress.
Code Sandbox Optimization
To train via RL on coding tasks, the model must generate and execute millions of code snippets to get reward signals. Each generated program needs to run in a sandbox to verify correctness (and prevent malicious code from damaging the training infrastructure). This creates a tight feedback loop: generate code → execute → get reward → update policy → repeat.
Standard Docker was too slow for this loop. Container startup overhead became a bottleneck when you need to spin up hundreds of sandboxes per second.
The team optimized container startup:
- crun runtime: 0.04s startup vs Docker's 0.12s (3x faster)
- Pre-created cgroups: Eliminates high-concurrency bottlenecks
- Overlay filesystem with tmpfs: High-speed, fixed-size storage
Result: 120 containers/second vs 27 with Docker (4.4x improvement).
Chain-of-Thought Reward Model
For problems without automatic verification (open-ended math, reasoning), the team trained a reward model that reasons through its assessment. This "thinking" reward model achieved 98.5% accuracy vs 84.4% for traditional reward models—a crucial improvement for stable RL training.
Business Implications
This paper has significant ramifications for the AI industry. Here's what different stakeholders can expect:
For AI Labs and Researchers
Lower Barrier to Frontier Reasoning: The simplified RL framework means research teams can achieve o1-level reasoning without implementing complex MCTS, value functions, or process reward models. A well-funded university lab can now pursue reasoning research that was previously thought to require OpenAI-scale resources.
Faster Iteration Cycles: Fewer components mean faster debugging and experimentation. Teams can test hypotheses about reasoning without untangling interactions between multiple subsystems.
Reproducible Baselines: The detailed training recipes provide a concrete starting point for future research, accelerating the field's progress on reasoning AI.
For Enterprise AI Teams
Efficient Deployment Options: The Long2Short distillation techniques enable cost-efficient reasoning models. A 3B parameter short-CoT model achieving 60.8 on AIME with ~3,000 tokens means enterprise applications can get strong reasoning at predictable, manageable inference costs.
Compute-Quality Tradeoffs: Organizations can choose their operating point. Use long-CoT for high-stakes decisions where accuracy matters most; use short-CoT for high-volume applications where cost efficiency is critical.
Infrastructure Simplification: The hybrid training-inference framework and container optimizations provide practical blueprints for teams building their own reasoning model training infrastructure.
For Model Providers
Competitive Pressure: Moonshot AI matching o1 with simpler techniques signals that reasoning capabilities may commoditize faster than expected. Providers relying on complex, hard-to-replicate architectures may find their moat eroding.
New Efficiency Benchmarks: The short-CoT results set a new bar for inference efficiency. Models that can't match similar quality-per-token ratios may struggle to justify their costs.
For Developers Building Applications
Reasoning Without Vendor Lock-in: As more teams replicate these techniques, developers gain options. Reasoning-capable models from multiple providers reduce dependency on any single API.
Predictable Costs: Short-CoT models with bounded token usage make budgeting for AI features more predictable than open-ended long reasoning chains.
For the Industry
Democratization of Reasoning AI: The core message is that frontier reasoning doesn't require frontier complexity. This lowers barriers for smaller players and accelerates the transition from "reasoning as a luxury" to "reasoning as infrastructure."
Limitations
The paper acknowledges several open challenges:
Verification Remains Hard
Automatically verifying answers is straightforward for code (run tests) and simple math (check the number). But for complex proofs, open-ended reasoning, or nuanced arguments, verification is unsolved. The team notes "developing more advanced verification models remains an open direction."
Reward Hacking Risks
Some problems have easily-guessable answers even if the reasoning is wrong. A model might learn to exploit these patterns rather than actually solve problems. The team addresses this through careful prompt curation, but it's not fully solved.
If 10% of math problems have answers that are "probably 0 or 1," a model could learn to guess these correctly without understanding the problem. The paper excludes such "easy-to-hack" prompts, but this requires manual curation that may not scale.
Overthinking
Long-context models sometimes generate excessively long responses even when unnecessary. The length penalty helps but doesn't eliminate this. Responses that could be 1,000 tokens might balloon to 10,000, wasting compute without improving accuracy.
Training Efficiency
While simpler than alternatives, RL training for reasoning is still expensive. The paper doesn't disclose total compute costs, but 128K context training at scale requires substantial GPU clusters.
Conclusion
Kimi k1.5 demonstrates that the path to reasoning AI may be simpler than assumed. By removing complex components and focusing on core principles—policy optimization, long context, and careful length management—Moonshot AI matched the performance of more elaborate systems.
Key Takeaways:
- Simplicity works: No MCTS, no value functions, no process reward models—just clean policy optimization
- Long context enables reasoning: 128K tokens let models think deeply when problems require it
- Efficiency is achievable: Long2Short methods deliver strong performance with 10x fewer tokens
- Multi-modal extends naturally: The same framework handles vision-language reasoning
- Infrastructure matters: Fast containers and hybrid frameworks enable practical RL training
For practitioners, k1.5 offers a blueprint: you don't need every technique in the literature. Start simple, scale context, optimize for efficiency. While the model weights remain closed, the paper provides detailed training recipes that teams with sufficient compute can replicate.
Original paper: arXiv ・ PDF ・ HTML
Cite this paper
Kimi Team, Flood Sung (2025). Kimi k1.5: Scaling Reinforcement Learning with LLMs. arXiv 2025.