- tags
- Reinforcement learning, Distillation, Large language models, In-context learning
- source
- (Hübotter et al. 2026)
Summary
This paper introduces Self-Distillation Policy Optimization (SDPO), a new algorithm for post-training large language models with reinforcement learning. Current methods for RL with verifiable rewards (RLVR), such as GRPO, learn only from sparse scalar outcome rewards (e.g., pass/fail), creating a severe credit-assignment bottleneck. The authors formalize a more general setting called Reinforcement Learning with Rich Feedback (RLRF), where environments provide tokenized feedback (runtime errors, judge evaluations) explaining why an attempt failed.
SDPO’s core insight is that the same model can serve as both student and teacher: conditioned on the rich feedback, the model can retrospectively identify where its original attempt went wrong. SDPO re-evaluates the log-probabilities of the original rollout under this feedback-augmented “self-teacher” context, then minimizes the KL divergence between the student and self-teacher next-token distributions. This yields dense, logit-level credit assignment without any external teacher model or explicit reward model. The method can be implemented as a drop-in replacement for GRPO by simply swapping the advantage computation.
Evaluated on scientific reasoning (SciKnowEval), tool use (ToolAlpaca), and competitive programming (LiveCodeBench v6), SDPO consistently outperforms strong GRPO baselines. On LCBv6, SDPO achieves 48.8% vs. GRPO’s 41.2% with Qwen3-8B, reaching GRPO’s final accuracy in 4x fewer generations. SDPO also produces substantially shorter, more concise reasoning traces (up to 11x shorter). The method’s effectiveness scales with model size, suggesting that self-teaching is an emergent capability. The paper further demonstrates Test-Time Self-Distillation, where SDPO applied at test time to individual hard questions accelerates solution discovery by 3x compared to best-of-k sampling.
Key Ideas
- RLRF formalization: Extends RLVR by allowing arbitrary tokenized feedback from the environment, not just scalar rewards
- Self-teacher mechanism: The same policy conditioned on feedback serves as its own teacher, leveraging in-context learning for retrospective error identification
- Dense credit assignment: SDPO assigns per-token logit-level advantages, unlike GRPO’s constant sequence-level advantages
- No external teacher needed: Unlike standard distillation, SDPO uses the model itself — the self-teacher improves during training, enabling bootstrapping from a weak initial model to a strong final one
- Works without rich feedback: Even in standard RLVR settings with only scalar rewards, SDPO outperforms GRPO by using successful rollouts as implicit feedback for failed attempts
- Concise reasoning: SDPO consistently produces shorter generations while achieving higher accuracy, suggesting it improves reasoning efficiency rather than just scaling response length
- Test-time self-distillation: SDPO can be applied at test time to specialize the model on a single hard question, compressing interaction context into model weights
- Emergent with scale: SDPO’s gains grow with model size, with self-teaching ability appearing to be an emergent phenomenon
Comments
SDPO is a compelling contribution that elegantly bridges distillation and reinforcement learning. The key novelty is recognizing that LLMs’ in-context learning ability can be repurposed for credit assignment — the model can “see” its own mistakes when given feedback. This is conceptually related to self-training, but operates at the logit level rather than generating new samples.
The finding that SDPO produces much shorter reasoning traces is particularly interesting — it suggests that verbose “chain-of-thought” patterns in GRPO-trained models may be artifacts of poor credit assignment rather than genuine reasoning. The test-time self-distillation application is also novel, showing that RL-style learning can happen on individual questions at inference time.
A limitation is that SDPO’s advantage over GRPO diminishes for smaller models (below Qwen3-0.6B), which makes sense since the self-teacher relies on in-context learning ability. The paper is closely related to (Zhang et al. 2026), which also uses self-distillation for code generation but in an off-policy, SFT-based manner.
Connections
- Related to Reinforcement learning as it introduces a new policy gradient algorithm for LLM post-training
- Related to Distillation because the core mechanism is knowledge distillation, but from the model to itself rather than from a separate teacher
- Related to In-context learning as SDPO fundamentally relies on the model’s ability to learn from feedback in-context
- Related to Self-training as both involve a model improving from its own outputs, though SDPO operates at the logit level rather than generating new training samples
- Related to Notes on: Embarrassingly Simple Self-Distillation as both use self-distillation for improving code generation, but SDPO is on-policy and uses RL while SSD uses off-policy SFT
- Related to Reinforcement learning with human feedback as SDPO extends the RLVR paradigm that RLHF pioneered for LLM alignment
- Related to Large language models as the method is specifically designed for post-training LLMs
Bibliography
- Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, et al.. . "Reinforcement Learning via Self-Distillation". https://arxiv.org/abs/2601.20802.
- Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, Yizhe Zhang. . "Embarrassingly Simple Self-Distillation Improves Code Generation". https://arxiv.org/abs/2604.01193. See notes
Loading comments...