MiMo-7B Reasoning Language Model Outperforms - ListenHub

MiMo-7B Reasoning Language Model Outperforms

ListenHub

0

4-30

Fromgithub

Unlocking the Reasoning Potential of Language Models: MiMo-7B

I. Introduction

MiMo-7B: A series of 7B models trained from scratch specifically for reasoning tasks.
Outperforms much larger 32B models in reasoning.
Achieves performance matching OpenAI o1-mini on math and code reasoning.
Focus on both pre-training and post-training strategies tailored to reasoning.

🌟 Highlights

Pre-Training: Base Model Born for Reasoning
- Optimized data preprocessing pipeline to increase reasoning pattern density.
- Employs multi-dimensional data filtering.
- Uses multiple strategies to generate diverse synthetic reasoning data.
- Three-stage data mixture strategy.
- Trained on approximately 25 trillion tokens.
- Incorporates Multiple-Token Prediction (MTP) to enhance performance and accelerate inference.
Post-Training Recipe: Pioneering Reasoning Model
- 130K math and code problems as RL training data, verified by rule-based verifiers.
- Uses only rule-based accuracy rewards to avoid reward hacking.
- Introduces a test difficulty driven code reward to mitigate sparse reward issue for challenging code problems.
- Data re-sampling strategy for easy problems to enhance rollout sampling efficiency and stabilize policy updates.
RL Infrastructures
- Developed a Seamless Rollout Engine to accelerate RL training and validation (2.29x faster training, 1.96x faster validation).
- Supports MTP in vLLM and enhances the robustness of the inference engine in the RL system.

II. Model Details

Models available at: https://huggingface.co/XiaomiMiMo

Model	Description	Download
MiMo-7B-Base	Base model with extraordinary reasoning potential	🤗 XiaomiMiMo/MiMo-7B-Base
MiMo-7B-RL-Zero	RL model trained from base model	🤗 XiaomiMiMo/MiMo-7B-RL-Zero
MiMo-7B-SFT	SFT model trained from base model	🤗 XiaomiMiMo/MiMo-7B-SFT
MiMo-7B-RL	RL model trained from SFT model, matching OpenAI o1-mini	🤗 XiaomiMiMo/MiMo-7B-RL

III. Evaluation Results

MiMo-7B-RL achieves strong performance, competing with larger models and even matching OpenAI's o1-mini.

Benchmark	MiMo-7B-RL
Mathematics
MATH500 (Pass@1)	95.8
AIME 2024 (Pass@1)	68.2
AIME 2025 (Pass@1)	55.4
Code
LiveCodeBench v5	57.8
LiveCodeBench v6	49.3

IV. Deployment

Recommended to use Xiaomi's fork of vLLM.
Recommended to use empty system prompt

Outline

Unlocking the Reasoning Potential of Language Models: MiMo-7B

I. Introduction

MiMo-7B: A series of 7B models trained from scratch specifically for reasoning tasks.
Outperforms much larger 32B models in reasoning.
Achieves performance matching OpenAI o1-mini on math and code reasoning.
Focus on both pre-training and post-training strategies tailored to reasoning.

🌟 Highlights

Pre-Training: Base Model Born for Reasoning
- Optimized data preprocessing pipeline to increase reasoning pattern density.
- Employs multi-dimensional data filtering.
- Uses multiple strategies to generate diverse synthetic reasoning data.
- Three-stage data mixture strategy.
- Trained on approximately 25 trillion tokens.
- Incorporates Multiple-Token Prediction (MTP) to enhance performance and accelerate inference.
Post-Training Recipe: Pioneering Reasoning Model
- 130K math and code problems as RL training data, verified by rule-based verifiers.
- Uses only rule-based accuracy rewards to avoid reward hacking.
- Introduces a test difficulty driven code reward to mitigate sparse reward issue for challenging code problems.
- Data re-sampling strategy for easy problems to enhance rollout sampling efficiency and stabilize policy updates.
RL Infrastructures
- Developed a Seamless Rollout Engine to accelerate RL training and validation (2.29x faster training, 1.96x faster validation).
- Supports MTP in vLLM and enhances the robustness of the inference engine in the RL system.

II. Model Details

Models available at: https://huggingface.co/XiaomiMiMo

Model	Description	Download
MiMo-7B-Base	Base model with extraordinary reasoning potential	🤗 XiaomiMiMo/MiMo-7B-Base
MiMo-7B-RL-Zero	RL model trained from base model	🤗 XiaomiMiMo/MiMo-7B-RL-Zero
MiMo-7B-SFT	SFT model trained from base model	🤗 XiaomiMiMo/MiMo-7B-SFT
MiMo-7B-RL	RL model trained from SFT model, matching OpenAI o1-mini	🤗 XiaomiMiMo/MiMo-7B-RL

III. Evaluation Results

MiMo-7B-RL achieves strong performance, competing with larger models and even matching OpenAI's o1-mini.

Benchmark	MiMo-7B-RL
Mathematics
MATH500 (Pass@1)	95.8
AIME 2024 (Pass@1)	68.2
AIME 2025 (Pass@1)	55.4
Code
LiveCodeBench v5	57.8
LiveCodeBench v6	49.3

IV. Deployment

Recommended to use Xiaomi's fork of vLLM.
Recommended to use empty system prompt

Script

Mia: Okay, so I saw this headline: MiMo-7B Reasoning Language Model Outperforms. And my brain kind of exploded. How does a tiny 7-billion-parameter model beat these massive 32-billion-parameter models at, like, actual *reasoning*? That sounds… bananas.

Mars: Right? It does sound crazy, but the MiMo-7B team, they weren't just throwing more data at the problem. They basically rebuilt the engine from the ground up, focusing on reasoning from the start.

Mia: So, like, they didn't just pump it full of information? What *did* they do?

Mars: Well, they created this whole pre-training pipeline designed to really pump up the reasoning patterns. And get this – they mixed the data in *three* stages.

Mia: Three stages? What does that even mean? Like, is that some kind of secret sauce?

Mars: Think of it like baking a cake, right? First layer is just plain text, you know, your basic vanilla. Then they add a layer of carefully selected reasoning examples, kind of like adding chocolate chips. And finally, they top it off with synthetic puzzles they *generated* themselves. Kind of like a layer of frosting with sprinkles. Each layer adds a different flavor, so the model develops a much richer taste for reasoning.

Mia: Okay, cake makes sense. But you also mentioned multiple-token prediction, which sounds like something out of science fiction.

Mars: MTP. Yeah, it's like your brain trying to guess not just the *next* word, but the next three or four all at once. It speeds things up and smooths out the errors. It's like, imagine reading ahead in a book, so you don't get totally lost halfway through a sentence.

Mia: Got it. So it's like... sharper and faster. Now, this pre-training stuff is cool, but what about *after*? Like, fine-tuning? Reinforcement learning?

Mars: Ah, good question. For post-training, they put together a set of 130,000 problems, mostly math and code questions, and here's the kicker – they verified them with rule-based checkers.

Mia: Rule-based checkers? Why not just use humans?

Mars: Because humans are fallible and inconsistent. The rule-based checkers are the gold standard. Then they used reinforcement learning, but *only* with the rule-based accuracy as the reward. This avoids any reward-hacking nonsense.

Mia: Reward-hacking? You mean like the model trying to cheat?

Mars: Exactly! And they also added a “test difficulty driven” code reward. The model doesn't skip the hard questions because those are also rewarded higher. Plus, they over-sample the easy stuff, just to keep the whole thing stable.

Mia: Feels like training an athlete for both sprints *and* marathons. Gotta mix it up.

Mars: Perfect analogy. And their RL infrastructure? It's called the Seamless Rollout Engine and it speeds up training and validation speeds.

Mia: Wow. So, all that sounds super impressive but what are the *metrics* looking like?

Mars: They basically went neck and neck with OpenAI's o1-mini.

Mia: Wait, so are you telling me that this 7B model can compete with a 32B model?

Mars: Yep. It proves that smart data recipes and clever training tricks can beat just throwing more computing power at a problem.

Mia: That’s wild. So, final question: if I want to play around with this thing, what do I do?

Mars: Just grab Xiaomi’s fork of vLLM, drop in an empty system prompt, and you're good to go. Even lightweight deployments get serious reasoning power.

Mia: Fascinating. MiMo-7B really flips the script on bigger is better. Thanks for breaking it down for me.

Mars: My pleasure. It's an exciting time, where strategy is just as important as size.