Notes on: Reinforcement Learning via Self-Distillation by Hübotter, J., Lübeck, F., Behric, L., Baumann, A., Bagatella, M., Marta, D., Hakimi, I., Shenfeld, I., Kleine Buening, T., Guestrin, C. & Krause, A. (2026)

tags: Reinforcement learning, Distillation, Large language models, In-context learning
source: (Hübotter et al. 2026)

Summary

This paper introduces Self-Distillation Policy Optimization (SDPO), a new algorithm for post-training large language models with reinforcement learning. Current methods for RL with verifiable rewards (RLVR), such as GRPO, learn only from sparse scalar outcome rewards (e.g., pass/fail), creating a severe credit-assignment bottleneck. The authors formalize a more general setting called Reinforcement Learning with Rich Feedback (RLRF), where environments provide tokenized feedback (runtime errors, judge evaluations) explaining why an attempt failed.

SDPO’s core insight is that the same model can serve as both student and teacher: conditioned on the rich feedback, the model can retrospectively identify where its original attempt went wrong. SDPO re-evaluates the log-probabilities of the original rollout under this feedback-augmented “self-teacher” context, then minimizes the KL divergence between the student and self-teacher next-token distributions. This yields dense, logit-level credit assignment without any external teacher model or explicit reward model. The method can be implemented as a drop-in replacement for GRPO by simply swapping the advantage computation.

Evaluated on scientific reasoning (SciKnowEval), tool use (ToolAlpaca), and competitive programming (LiveCodeBench v6), SDPO consistently outperforms strong GRPO baselines. On LCBv6, SDPO achieves 48.8% vs. GRPO’s 41.2% with Qwen3-8B, reaching GRPO’s final accuracy in 4x fewer generations. SDPO also produces substantially shorter, more concise reasoning traces (up to 11x shorter). The method’s effectiveness scales with model size, suggesting that self-teaching is an emergent capability. The paper further demonstrates Test-Time Self-Distillation, where SDPO applied at test time to individual hard questions accelerates solution discovery by 3x compared to best-of-k sampling.

Key Ideas

RLRF formalization: Extends RLVR by allowing arbitrary tokenized feedback from the environment, not just scalar rewards
Self-teacher mechanism: The same policy conditioned on feedback serves as its own teacher, leveraging in-context learning for retrospective error identification
Dense credit assignment: SDPO assigns per-token logit-level advantages, unlike GRPO’s constant sequence-level advantages
No external teacher needed: Unlike standard distillation, SDPO uses the model itself — the self-teacher improves during training, enabling bootstrapping from a weak initial model to a strong final one
Works without rich feedback: Even in standard RLVR settings with only scalar rewards, SDPO outperforms GRPO by using successful rollouts as implicit feedback for failed attempts
Concise reasoning: SDPO consistently produces shorter generations while achieving higher accuracy, suggesting it improves reasoning efficiency rather than just scaling response length
Test-time self-distillation: SDPO can be applied at test time to specialize the model on a single hard question, compressing interaction context into model weights
Emergent with scale: SDPO’s gains grow with model size, with self-teaching ability appearing to be an emergent phenomenon

Comments

SDPO is a compelling contribution that elegantly bridges distillation and reinforcement learning. The key novelty is recognizing that LLMs’ in-context learning ability can be repurposed for credit assignment — the model can “see” its own mistakes when given feedback. This is conceptually related to self-training, but operates at the logit level rather than generating new samples.

The finding that SDPO produces much shorter reasoning traces is particularly interesting — it suggests that verbose “chain-of-thought” patterns in GRPO-trained models may be artifacts of poor credit assignment rather than genuine reasoning. The test-time self-distillation application is also novel, showing that RL-style learning can happen on individual questions at inference time.

A limitation is that SDPO’s advantage over GRPO diminishes for smaller models (below Qwen3-0.6B), which makes sense since the self-teacher relies on in-context learning ability. The paper is closely related to (Zhang et al. 2026), which also uses self-distillation for code generation but in an off-policy, SFT-based manner.

Connections

Related to Reinforcement learning as it introduces a new policy gradient algorithm for LLM post-training
Related to Distillation because the core mechanism is knowledge distillation, but from the model to itself rather than from a separate teacher
Related to In-context learning as SDPO fundamentally relies on the model’s ability to learn from feedback in-context
Related to Self-training as both involve a model improving from its own outputs, though SDPO operates at the logit level rather than generating new training samples
Related to Notes on: Embarrassingly Simple Self-Distillation as both use self-distillation for improving code generation, but SDPO is on-policy and uses RL while SSD uses off-policy SFT
Related to Reinforcement learning with human feedback as SDPO extends the RLVR paradigm that RLHF pioneered for LLM alignment
Related to Large language models as the method is specifically designed for post-training LLMs

Bibliography

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, et al.. January 28, 2026. "Reinforcement Learning via Self-Distillation". https://arxiv.org/abs/2601.20802.
Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, Yizhe Zhang. April 1, 2026. "Embarrassingly Simple Self-Distillation Improves Code Generation". https://arxiv.org/abs/2604.01193. See notes

Reinforcement Learning via Self-Distillation by Hübotter, J., Lübeck, F., Behric, L., Baumann, A., Bagatella, M., Marta, D., Hakimi, I., Shenfeld, I., Kleine Buening, T., Guestrin, C. & Krause, A. (2026)

Summary

Key Ideas

Comments

Connections

Bibliography

Comments

Leave a comment