Three early 2026 papers (MiniMax-M1 (CISPO), Zhang et al. (SSD), and Hübotter et al. (SDPO)) converge on a shared structural observation: not all tokens in a reasoning trace are equally important for learning, and naive uniform treatment of tokens is a core failure mode of current training methods.
The fork/filler distinction
All three papers implicitly or explicitly distinguish between two kinds of positions in generated sequences:
- Fork tokens: they are rare but critical positions where the model makes a genuine reasoning decision (redirecting a chain of thought, choosing an algorithm, catching an error). Examples: “However”, “Wait”, “Recheck” in COT traces, or algorithmic branching points in code.
- Filler/lock tokens: some high-frequency, low-information tokens where the correct continuation is largely determined by context (boilerplate reasoning steps, syntactic completions, connective phrases).
Standard reinforcement learning methods like PPO and GRPO assign uniform or near-uniform credit across all tokens in a successful rollout, which means filler tokens receive the same reinforcement as fork tokens. This leads to verbose, inefficient reasoning traces and undertrained decision points.
Three solutions at different levels
Each paper addresses this problem at a different level of the training pipeline:
CISPO ((MiniMax 2025))
MiniMax-M1 introduces Clipped IS-weight Policy Optimization, which clips importance sampling weights at the sequence level rather than clipping token-level policy ratios as in PPO/GRPO.
The motivation is that fork tokens are rare and have the largest policy shifts, but also are disproportionately suppressed by token-level clipping. By moving clipping to the sequence level, CISPO ensures full gradient flow through these critical positions. This is especially important in their hybrid softmax/linear attention architecture where the problem is amplified.
SSD ((Zhang et al. 2026))
Simple Self-Distillation approaches the problem from the distribution-reshaping side. By sampling from the model at a shifted temperature with truncation, then fine-tuning on those samples, SSD performs support compression at lock/filler positions (sharpening diffuse tails into spikes) and within-support reshaping at fork positions (redistributing mass among viable alternatives).
This is a form of implicit token-level credit assignment: the training signal is concentrated where the distribution actually needs to change (forks), not where it’s already peaked (fillers). Even training on 62% gibberish data still improves the model. This confirms that the benefit of the method comes from distribution geometry, not content quality.
SDPO ((Hübotter et al. 2026))
Self-Distillation Policy Optimization is the most direct solution. By conditioning the model on rich feedback (error messages, judge evaluations) and using it as its own teacher via in-context learning, SDPO produces dense per-token logit-level advantages. The self-teacher can retrospectively identify exactly which tokens in a failed rollout were wrong, replacing GRPO’s flat sequence-level advantage with targeted credit.
The result: SDPO produces reasoning traces up to 11x shorter while improving accuracy — strong evidence that verbose COT in GRPO-trained models is an artifact of poor credit assignment rather than genuine reasoning.
Comparison
| Method | Level | Mechanism | Requires |
|---|---|---|---|
| CISPO | Sequence-level clipping | Preserve rare fork gradients | Standard rewards |
| SSD | Distribution reshaping | Compress fillers, flatten forks via temperature | Only the model + prompts |
| SDPO | Per-token advantages | Dense credit via retrospective self-evaluation | Rich feedback (errors, judges) |
SDPO is the most principled on the credit-assignment axis. It doesn’t just preserve fork signals (CISPO) or implicitly compress fillers (SSD), it directly identifies which tokens matter. The cost is needing structured feedback; CISPO and SSD work with standard scalar rewards or no rewards at all.
Implications
The convergence of these three independent approaches suggests that token-level credit assignment in reasoning traces is a fundamental bottleneck in current LLM post-training. The verbose, meandering chains of thought produced by GRPO-trained models may largely be artifacts of uniform reinforcement on filler tokens, not a necessary feature of reasoning. Methods that differentiate between forks and fillers, whether through clipping strategy, distribution reshaping, or dense advantages will consistently produce shorter, more accurate reasoning.
This also re-frames distillation and self-training more broadly: the value of these methods may lie less in transferring knowledge and more in reshaping probability distributions to allocate model capacity toward decision points rather than filler.
Bibliography
- MiniMax. . "MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention". https://arxiv.org/abs/2506.13585. See notes
- Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, Yizhe Zhang. . "Embarrassingly Simple Self-Distillation Improves Code Generation". https://arxiv.org/abs/2604.01193. See notes
- Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, et al.. . "Reinforcement Learning via Self-Distillation". https://arxiv.org/abs/2601.20802. See notes
Loading comments...