- tags
- Foundation models, Reinforcement learning, Transformers, Scaling laws
- source
- (MiniMax 2025)
Summary
MiniMax-M1 is the first open-weight, large-scale reasoning model built on a hybrid attention architecture combining Transformers with lightning attention (a linear attention variant). The model uses a Mixture-of-Experts (MoE) design with 456 billion total parameters (45.9B activated per token) and natively supports 1 million token context length — 8x that of DeepSeek R1. The hybrid design alternates softmax attention blocks with linear attention (transnormer) blocks in a 1:7 ratio, enabling near-linear scaling of inference FLOPs with generation length.
The model is trained through continual pretraining on 7.5T tokens, supervised fine-tuning to instill chain-of-thought reasoning patterns, and large-scale reinforcement learning. A key contribution is CISPO (Clipped IS-weight Policy Optimization), a novel RL algorithm that clips importance sampling weights rather than token-level policy updates as in PPO/GRPO. This avoids dropping gradient contributions from rare but important reasoning tokens (e.g., “However”, “Wait”, “Recheck”), which is especially problematic in hybrid-attention architectures. CISPO achieves 2x speedup over DAPO on AIME 2024.
The full RL training completed in 3 weeks on 512 H800 GPUs (~$534K). Two model variants are released with 40K and 80K thinking budgets. On benchmarks, MiniMax-M1-80k is comparable to DeepSeek-R1 and Qwen3-235B, with particular strengths in software engineering (56% on SWE-bench Verified), long-context understanding (73.4% on OpenAI-MRCR 128k), and agentic tool use.
Key Ideas
- Hybrid attention architecture mixing softmax and lightning (linear) attention in 1:7 ratio enables efficient test-time compute scaling — at 100K generation length, M1 uses only 25% of DeepSeek R1’s FLOPs
- CISPO: clips importance sampling weights instead of token updates, preserving gradient contributions from all tokens including rare reasoning-critical tokens
- Precision mismatch fix: FP32 for the LM output head resolves training/inference probability divergence specific to hybrid architectures
- Diverse RL training data: math, logic (SynLogic), competitive programming, sandbox-based software engineering, and general tasks with generative reward models
- Curriculum strategy: start with rule-verifiable reasoning tasks, gradually mix in general domain tasks with model-based rewards
- Length scaling strategy: staged window expansion from 40K to 80K output tokens with empirical stability indicators
- Early truncation via repetition detection: halt generation when 3,000 consecutive tokens each have probability > 0.99
- Online monitoring and recalibration of generative reward models to prevent length bias exploitation
Comments
This paper represents a significant engineering achievement in training a competitive open-weight reasoning model with a non-standard architecture. The hybrid softmax/linear attention approach is the key differentiator — while most reasoning models rely on full softmax attention, MiniMax-M1 demonstrates that efficient architectures can be competitive. The connection to linear attention research is direct, as lightning attention is an I/O-aware implementation of linear attention.
The CISPO algorithm addresses a real problem: standard PPO/GRPO clipping can suppress learning on rare but semantically important tokens. This is especially relevant for reasoning models where “fork” tokens that redirect the chain of thought are both rare and critical. The Switch transformer and other MoE models have shown parameter efficiency benefits, but MiniMax-M1 is notable for combining MoE with linear attention for reasoning.
The practical details around training stability — precision fixes, repetition detection, reward model bias mitigation — are valuable contributions for anyone scaling RL training on large models. The $534K training cost demonstrates that competitive reasoning models can be trained at accessible (relative to frontier labs) compute budgets.
Connections
- Related to Foundation models because M1 is a frontier open-weight foundation model for reasoning
- Related to Reinforcement learning because the paper’s core training methodology is large-scale RL with a novel algorithm (CISPO)
- Related to Transformers because the architecture is a hybrid variant of the transformer
- Related to Scaling laws because the paper demonstrates test-time compute scaling behavior with hybrid attention
- Related to Switch transformer because both use Mixture-of-Experts architectures for parameter efficiency
- Related to Transformers are RNNs (Katharopoulos 2020) because lightning attention builds on linear attention mechanisms
- Related to Coding agent because M1 is trained with sandbox-based software engineering RL environments
Bibliography
- MiniMax. . "MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention". https://arxiv.org/abs/2506.13585.
Loading comments...