Notes on: MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention by MiniMax (2025)

tags: Foundation models, Reinforcement learning, Transformers, Scaling laws
source: (MiniMax 2025)

Summary

MiniMax-M1 is the first open-weight, large-scale reasoning model built on a hybrid attention architecture combining Transformers with lightning attention (a linear attention variant). The model uses a Mixture-of-Experts (MoE) design with 456 billion total parameters (45.9B activated per token) and natively supports 1 million token context length — 8x that of DeepSeek R1. The hybrid design alternates softmax attention blocks with linear attention (transnormer) blocks in a 1:7 ratio, enabling near-linear scaling of inference FLOPs with generation length.

The model is trained through continual pretraining on 7.5T tokens, supervised fine-tuning to instill chain-of-thought reasoning patterns, and large-scale reinforcement learning. A key contribution is CISPO (Clipped IS-weight Policy Optimization), a novel RL algorithm that clips importance sampling weights rather than token-level policy updates as in PPO/GRPO. This avoids dropping gradient contributions from rare but important reasoning tokens (e.g., “However”, “Wait”, “Recheck”), which is especially problematic in hybrid-attention architectures. CISPO achieves 2x speedup over DAPO on AIME 2024.

The full RL training completed in 3 weeks on 512 H800 GPUs (~$534K). Two model variants are released with 40K and 80K thinking budgets. On benchmarks, MiniMax-M1-80k is comparable to DeepSeek-R1 and Qwen3-235B, with particular strengths in software engineering (56% on SWE-bench Verified), long-context understanding (73.4% on OpenAI-MRCR 128k), and agentic tool use.

Key Ideas

Hybrid attention architecture mixing softmax and lightning (linear) attention in 1:7 ratio enables efficient test-time compute scaling — at 100K generation length, M1 uses only 25% of DeepSeek R1’s FLOPs
CISPO: clips importance sampling weights instead of token updates, preserving gradient contributions from all tokens including rare reasoning-critical tokens
Precision mismatch fix: FP32 for the LM output head resolves training/inference probability divergence specific to hybrid architectures
Diverse RL training data: math, logic (SynLogic), competitive programming, sandbox-based software engineering, and general tasks with generative reward models
Curriculum strategy: start with rule-verifiable reasoning tasks, gradually mix in general domain tasks with model-based rewards
Length scaling strategy: staged window expansion from 40K to 80K output tokens with empirical stability indicators
Early truncation via repetition detection: halt generation when 3,000 consecutive tokens each have probability > 0.99
Online monitoring and recalibration of generative reward models to prevent length bias exploitation

Comments

This paper represents a significant engineering achievement in training a competitive open-weight reasoning model with a non-standard architecture. The hybrid softmax/linear attention approach is the key differentiator — while most reasoning models rely on full softmax attention, MiniMax-M1 demonstrates that efficient architectures can be competitive. The connection to linear attention research is direct, as lightning attention is an I/O-aware implementation of linear attention.

The CISPO algorithm addresses a real problem: standard PPO/GRPO clipping can suppress learning on rare but semantically important tokens. This is especially relevant for reasoning models where “fork” tokens that redirect the chain of thought are both rare and critical. The Switch transformer and other MoE models have shown parameter efficiency benefits, but MiniMax-M1 is notable for combining MoE with linear attention for reasoning.

The practical details around training stability — precision fixes, repetition detection, reward model bias mitigation — are valuable contributions for anyone scaling RL training on large models. The $534K training cost demonstrates that competitive reasoning models can be trained at accessible (relative to frontier labs) compute budgets.

Connections

Related to Foundation models because M1 is a frontier open-weight foundation model for reasoning
Related to Reinforcement learning because the paper’s core training methodology is large-scale RL with a novel algorithm (CISPO)
Related to Transformers because the architecture is a hybrid variant of the transformer
Related to Scaling laws because the paper demonstrates test-time compute scaling behavior with hybrid attention
Related to Switch transformer because both use Mixture-of-Experts architectures for parameter efficiency
Related to Transformers are RNNs (Katharopoulos 2020) because lightning attention builds on linear attention mechanisms
Related to Coding agent because M1 is trained with sandbox-based software engineering RL environments

Bibliography

MiniMax. June 16, 2025. "MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention". https://arxiv.org/abs/2506.13585.

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention by MiniMax (2025)

Summary

Key Ideas

Comments

Connections

Bibliography

Comments

Leave a comment