Kimi k1.5: Scaling Reinforcement Learning with LLMs

TL;DR

Simplicity Wins: Kimi k1.5 matches OpenAI o1 on math (77.5 AIME) and coding (94th percentile Codeforces) without MCTS, value functions, or process reward models — just policy optimization with length penalties
128K Context Matters: Extended context lets models "think longer" on hard problems. The short-CoT variant achieves 60.8 AIME with only ~3,000 tokens — 550% better than GPT-4o
The Implication: Frontier reasoning doesn't require frontier complexity. Teams can achieve o1-level performance with simpler, more reproducible training methods

Research Overview

OpenAI's o1 model demonstrated that letting language models "think longer" through chain-of-thought reasoning dramatically improves performance on hard problems. But o1 remained closed, leaving researchers to speculate about what techniques made it work. Many assumed complex methods like Monte Carlo tree search (MCTS), learned value functions, or process reward models were essential.

Kimi k1.5 challenges this assumption. Moonshot AI's team achieved o1-level performance using a surprisingly simple approach: standard policy optimization with careful length management. No tree search. No value networks. No step-by-step reward models.

The Core Insight

Complex RL techniques like MCTS and value functions were designed for games with clear state spaces (chess, Go). Language reasoning is different—the "state" is a partial text response, and the action space is essentially infinite. Kimi k1.5 shows that simpler methods work better when adapted properly for language.

Key Results at a Glance

Benchmark	Kimi k1.5	OpenAI o1	What It Tests
AIME 2024	77.5	74.4	Competition math (hardest)
MATH-500	96.2	94.8	Mathematical reasoning
Codeforces	94th %ile	94th %ile	Competitive programming
MathVista	74.9	71.0	Visual math reasoning
LiveCodeBench	62.5	67.2	Real-world coding

Long-CoT Benchmark Comparison: Kimi k1.5 vs OpenAI o1

Both models achieve similar performance, but k1.5 uses simpler training methods

Why This Matters

The Reasoning Revolution

Traditional language models predict the next token based on patterns in training data. They're essentially very sophisticated pattern matchers. But hard problems—competition math, complex code, scientific reasoning—require more than pattern matching. They require thinking.

What is Chain-of-Thought (CoT)?

Instead of jumping directly to an answer, chain-of-thought prompts models to "show their work"—writing out intermediate reasoning steps. This simple technique dramatically improves accuracy on complex problems because each step can be checked and corrected before proceeding.

The challenge is training models to produce good chains of thought. Reinforcement learning offers a path: reward correct final answers, let the model learn what reasoning patterns lead to success. But RL for language is notoriously unstable and sample-inefficient.

The Complexity Trap

After o1's release, many teams rushed to replicate it using increasingly complex techniques:

Monte Carlo Tree Search: Explore multiple reasoning paths, backtrack when stuck
Value Functions: Learn to predict success probability at each step
Process Reward Models: Score intermediate reasoning steps, not just final answers
Beam Search with Verification: Generate multiple candidates, filter with trained verifiers

Kimi k1.5 shows these additions may be unnecessary overhead. The team achieved comparable results with just:

Policy optimization (standard RL)
Length penalties (prevent overthinking)
Long context training (let the model think deeply when needed)

The Simplicity Principle

Policy Optimization Without Extras

The Kimi k1.5 training framework uses online mirror descent with KL regularization—a standard RL algorithm. For each problem, the model generates multiple response attempts. Correct answers receive positive reward; incorrect answers receive negative reward.

The Simplicity Approach: What k1.5 Uses vs Removes

Achieving o1-level performance with fewer components

Why No Value Functions?

Value functions predict "how likely am I to succeed from this point?" In games like chess, this is useful because you can evaluate positions. In language, the "position" is a partial response, and evaluating it requires essentially solving the whole problem. The Kimi team argues that exploring more responses is more valuable than trying to predict their success.

The key innovations are in the details:

Negative Gradients Matter: Unlike simpler methods that only learn from successes (rejection sampling), k1.5 explicitly penalizes incorrect responses. This "push away from wrong answers" signal accelerates learning.

Curriculum Sampling: Start training on a mix of easy and hard problems. As training progresses, shift focus to harder problems where the model still makes mistakes. This prevents wasting compute on already-solved easy cases.

Length Penalty Warmup: Gradually increase the penalty for overly long responses. Starting with length penalties too high prevents the model from learning to reason deeply; too low leads to rambling.

What They Removed

The paper explicitly notes what they did not use:

Technique	Status	Reasoning
Monte Carlo Tree Search	Removed	"Not suitable for our context"
Value Functions	Removed	Exploration more valuable than prediction
Process Reward Models	Removed	End-to-end learning sufficient
Beam Search	Removed	Simple sampling works

Training-Time vs Test-Time Compute

MCTS requires expensive computation during inference—every user query triggers a search through multiple reasoning paths. Kimi's approach moves this cost to training time: invest heavily in learning a better policy upfront, so the model can respond directly without searching. This is a massive difference for deployment: one API call vs. potentially hundreds of internal searches per query.

This simplification isn't just philosophical—it has practical benefits. Fewer components means faster iteration, easier debugging, and more stable training.

Long Context Reasoning

The 128K Token Advantage

Kimi k1.5 scales context to 128,000 tokens—about 100 pages of text. This matters because harder problems require longer reasoning chains.

Why Long Context Helps Reasoning

Complex math problems might require dozens of intermediate steps. A model limited to short responses must compress or skip steps, losing accuracy. With 128K tokens, k1.5 can write out every step, check its work, try alternative approaches, and still have room for the final answer.

The paper shows a strong correlation: problems requiring longer reasoning chains benefit most from long context. On AIME (the hardest math benchmark), the average successful response is significantly longer than on easier problems.

Partial Rollouts: Making Long Context Practical

Training with 128K token responses is computationally expensive. A naive approach would be prohibitively slow. The Kimi team developed "partial rollouts" to solve this:

Set a fixed token budget per training step (e.g., 8K tokens)
If a response exceeds this budget, pause and save progress
Resume from the saved checkpoint in the next iteration
Previous segments are cached and reused, not recomputed

The Time-Slicing Analogy

Think of it like CPU time-slicing in operating systems. Just as an OS switches between tasks so no single process blocks the CPU, partial rollouts let GPUs work on many chains of thought in parallel. A 50,000-token response doesn't block a GPU for minutes—it gets "time-sliced" across multiple training steps while other responses complete. No memory overflows, no idle workers.

The system also includes repeat detection: if the model starts generating repetitive content (a common failure mode in long generation), it terminates early and applies a penalty. This prevents wasted compute on degenerate responses.

Long2Short: Policy Distillation for Efficiency

One concern with long-chain-of-thought models: they're expensive at inference time. A response that takes 10,000 tokens costs 10x more than a 1,000 token response. For many applications, this overhead is unacceptable.

Kimi k1.5 introduces "Long2Short" methods—a form of policy distillation that transfers the reasoning capabilities of a slow, powerful "teacher" model into a fast, efficient "student" model.

What is Policy Distillation?

In traditional knowledge distillation, a small model learns to mimic a large model's outputs. Policy distillation extends this to RL: the student learns not just what answers to give, but what reasoning strategies lead to correct answers. The long-CoT model's "policy" (its learned behavior) gets compressed into the short-CoT model.

Short-CoT Performance: k1.5-short vs Competitors

Efficient reasoning without long chain-of-thought. AIME scores shown (higher = better)

Four Complementary Approaches

1. Model Merging The simplest approach: average the weights of the long-CoT model with a standard short-response model. No additional training required. This works surprisingly well as a baseline.

2. Shortest Rejection Sampling Generate 8 responses for each problem. Keep only the shortest correct response for supervised fine-tuning. This teaches the model that concise correct answers are preferred over verbose ones.

3. Direct Preference Optimization (DPO) Train the model to prefer shorter responses using pairs:

Positive: shortest correct solution
Negative: longer responses (both incorrect AND correct-but-verbose)

This explicitly teaches the model that length itself is undesirable.

4. Long2Short RL A dedicated RL phase with aggressive length penalties and reduced maximum response length. The model learns to compress its reasoning while maintaining accuracy.

Efficiency Results

The short-CoT variant (k1.5-short) achieves remarkable efficiency:

Token Efficiency: Score vs Tokens Used

k1.5-short achieves 6.5x better scores than GPT-4o with similar token counts

Model	AIME Score	Avg Tokens	Tokens per Point
k1.5 (long)	77.5	~32,000	~413
k1.5 (short)	60.8	3,272	~54
GPT-4o	9.3	~2,000	~215
Claude 3.5	16.0	~2,500	~156

The short model uses 10x fewer tokens than the long model while retaining 78% of its performance. Compared to GPT-4o, it achieves 6.5x higher scores with similar token counts.

The Practical Implication

For deployment, you can choose your tradeoff. Need maximum accuracy on the hardest problems? Use the long model. Need cost efficiency for high-volume applications? The short model outperforms competitors at a fraction of the compute.

Kimi k1.5 isn't just a text model—it processes images too. The multi-modal training enables visual reasoning tasks like chart interpretation, geometry problems, and scientific diagrams.

Vision RL Training

The team curated three types of visual training data:

Real-World Data: Science questions with diagrams, location inference from photos, chart and graph analysis. These teach practical visual understanding.

Synthetic Visual Reasoning: Procedurally generated images testing spatial relationships, geometric properties, and visual logic. Perfect labels enable precise training signal.

Text-Rendered Data: Textual content converted to images. This maintains reasoning capabilities when information is presented visually (common in documents and slides).

Visual Benchmark Results

Benchmark	Kimi k1.5	OpenAI o1	Description
MathVista	74.9	71.0	Math with diagrams
MMMU	70.0	-	Multi-modal understanding

The model processes approximately 1 million text-vision examples during supervised fine-tuning, covering chart interpretation, OCR, image-grounded conversations, visual coding, and visual reasoning.

Benchmark Results

Long-CoT Performance

On benchmarks requiring deep reasoning, k1.5 matches or exceeds o1:

Understanding AIME

The American Invitational Mathematics Examination is one of the hardest math competitions, with problems that challenge the best human mathematicians. A score of 77.5 (out of 100) would place in the top tier of human competitors. For context, average high school math olympiad participants score around 20-30.

The Codeforces result is particularly notable: 94th percentile means the model outperforms 94% of human competitive programmers on timed algorithm challenges.

Short-CoT Dominance

The real breakthrough is in efficient reasoning. The short-CoT model dramatically outperforms existing models:

Model	AIME	MATH-500	LiveCodeBench
k1.5-short	60.8	94.6	47.3
Claude 3.5 Sonnet	16.0	78.3	36.3
GPT-4o	9.3	74.6	34.2
DeepSeek V3	39.2	90.2	37.6

The "+550%" improvement claim comes from comparing k1.5-short's AIME score (60.8) to GPT-4o's (9.3): a 6.5x improvement.

Infrastructure Innovations

Hybrid Training-Inference Framework

Training RL models requires both training (updating weights) and inference (generating responses). Traditional setups switch between these modes, wasting time on transitions. Kimi's hybrid framework runs both simultaneously:

Megatron handles training in dedicated containers
vLLM handles inference in separate containers
A checkpoint-engine coordinates weight synchronization

The Speed Gain

Transitions between training and inference take less than one minute (down from potentially hours). This enables rapid iteration: try a training change, generate samples, evaluate, repeat. Fast cycles accelerate research progress.

Code Sandbox Optimization

To train via RL on coding tasks, the model must generate and execute millions of code snippets to get reward signals. Each generated program needs to run in a sandbox to verify correctness (and prevent malicious code from damaging the training infrastructure). This creates a tight feedback loop: generate code → execute → get reward → update policy → repeat.

Standard Docker was too slow for this loop. Container startup overhead became a bottleneck when you need to spin up hundreds of sandboxes per second.

The team optimized container startup:

crun runtime: 0.04s startup vs Docker's 0.12s (3x faster)
Pre-created cgroups: Eliminates high-concurrency bottlenecks
Overlay filesystem with tmpfs: High-speed, fixed-size storage

Result: 120 containers/second vs 27 with Docker (4.4x improvement).

Chain-of-Thought Reward Model

For problems without automatic verification (open-ended math, reasoning), the team trained a reward model that reasons through its assessment. This "thinking" reward model achieved 98.5% accuracy vs 84.4% for traditional reward models—a crucial improvement for stable RL training.

Business Implications

This paper has significant ramifications for the AI industry. Here's what different stakeholders can expect:

For AI Labs and Researchers

Lower Barrier to Frontier Reasoning: The simplified RL framework means research teams can achieve o1-level reasoning without implementing complex MCTS, value functions, or process reward models. A well-funded university lab can now pursue reasoning research that was previously thought to require OpenAI-scale resources.

Faster Iteration Cycles: Fewer components mean faster debugging and experimentation. Teams can test hypotheses about reasoning without untangling interactions between multiple subsystems.

Reproducible Baselines: The detailed training recipes provide a concrete starting point for future research, accelerating the field's progress on reasoning AI.

For Enterprise AI Teams

Efficient Deployment Options: The Long2Short distillation techniques enable cost-efficient reasoning models. A 3B parameter short-CoT model achieving 60.8 on AIME with ~3,000 tokens means enterprise applications can get strong reasoning at predictable, manageable inference costs.

Compute-Quality Tradeoffs: Organizations can choose their operating point. Use long-CoT for high-stakes decisions where accuracy matters most; use short-CoT for high-volume applications where cost efficiency is critical.

Infrastructure Simplification: The hybrid training-inference framework and container optimizations provide practical blueprints for teams building their own reasoning model training infrastructure.

For Model Providers

Competitive Pressure: Moonshot AI matching o1 with simpler techniques signals that reasoning capabilities may commoditize faster than expected. Providers relying on complex, hard-to-replicate architectures may find their moat eroding.

New Efficiency Benchmarks: The short-CoT results set a new bar for inference efficiency. Models that can't match similar quality-per-token ratios may struggle to justify their costs.

For Developers Building Applications

Reasoning Without Vendor Lock-in: As more teams replicate these techniques, developers gain options. Reasoning-capable models from multiple providers reduce dependency on any single API.

Predictable Costs: Short-CoT models with bounded token usage make budgeting for AI features more predictable than open-ended long reasoning chains.

For the Industry

Democratization of Reasoning AI: The core message is that frontier reasoning doesn't require frontier complexity. This lowers barriers for smaller players and accelerates the transition from "reasoning as a luxury" to "reasoning as infrastructure."

Limitations

The paper acknowledges several open challenges:

Verification Remains Hard

Automatically verifying answers is straightforward for code (run tests) and simple math (check the number). But for complex proofs, open-ended reasoning, or nuanced arguments, verification is unsolved. The team notes "developing more advanced verification models remains an open direction."

Reward Hacking Risks

Some problems have easily-guessable answers even if the reasoning is wrong. A model might learn to exploit these patterns rather than actually solve problems. The team addresses this through careful prompt curation, but it's not fully solved.

The Reward Hacking Problem

If 10% of math problems have answers that are "probably 0 or 1," a model could learn to guess these correctly without understanding the problem. The paper excludes such "easy-to-hack" prompts, but this requires manual curation that may not scale.

Overthinking

Long-context models sometimes generate excessively long responses even when unnecessary. The length penalty helps but doesn't eliminate this. Responses that could be 1,000 tokens might balloon to 10,000, wasting compute without improving accuracy.

Training Efficiency

While simpler than alternatives, RL training for reasoning is still expensive. The paper doesn't disclose total compute costs, but 128K context training at scale requires substantial GPU clusters.

Conclusion

Kimi k1.5 demonstrates that the path to reasoning AI may be simpler than assumed. By removing complex components and focusing on core principles—policy optimization, long context, and careful length management—Moonshot AI matched the performance of more elaborate systems.

Key Takeaways:

Simplicity works: No MCTS, no value functions, no process reward models—just clean policy optimization
Long context enables reasoning: 128K tokens let models think deeply when problems require it
Efficiency is achievable: Long2Short methods deliver strong performance with 10x fewer tokens
Multi-modal extends naturally: The same framework handles vision-language reasoning
Infrastructure matters: Fast containers and hybrid frameworks enable practical RL training

For practitioners, k1.5 offers a blueprint: you don't need every technique in the literature. Start simple, scale context, optimize for efficiency. While the model weights remain closed, the paper provides detailed training recipes that teams with sufficient compute can replicate.

Original paper: arXiv ・ PDF ・ HTML

Authors

Kimi TeamMoonshot AI,Flood SungMoonshot AI

Cite this paper

Kimi Team, Flood Sung (2025). Kimi k1.5: Scaling Reinforcement Learning with LLMs. arXiv 2025.

Key Findings