- tags
- Knowledge distillation, Language modeling, Program synthesis, Large language models
- source
- (Zhang et al. 2026)
Summary
This paper introduces simple self-distillation (SSD), a method where an LLM improves its own code generation by sampling solutions from itself with specific temperature and truncation settings, then fine-tuning on those raw, unverified samples using standard supervised fine-tuning (cross-entropy loss). Crucially, SSD requires no external teacher model, no verifier, no execution environment, no reward model, and no reinforcement learning — only the model itself and a set of problem prompts.
SSD achieves substantial gains: Qwen3-30B-Instruct improves from 42.4% to 55.3% pass@1 on LiveCodeBench v6, a +30% relative improvement. The method generalizes across both Qwen and Llama model families at 4B, 8B, and 30B scale, including both instruct and thinking variants. Gains concentrate on harder problems, and pass@5 often improves more than pass@1, suggesting SSD preserves generation diversity rather than collapsing to a single mode.
The paper provides a compelling mechanistic explanation through the precision-exploration conflict: code generation interleaves “lock” positions (where the correct token is clear and distractors must be suppressed) and “fork” positions (where multiple valid algorithmic approaches exist and diversity should be preserved). A single global decoding temperature cannot satisfy both simultaneously. SSD resolves this by reshaping token distributions in a context-dependent way — performing support compression (trimming diffuse tails) and within-support reshaping (redistributing mass among viable tokens). The authors validate this through toy simulations, real-model analysis on Qwen3-30B-Instruct, and a formal theoretical decomposition of the SSD objective into three interpretable terms (Equation 4).
Key Ideas
- SSD samples from the base model at training-time temperature \(T_{\text{train}}\) with truncation (top-\(k\), top-\(p\)), then fine-tunes on those raw outputs — no filtering by correctness
- Only a single sample per prompt (N=1) already suffices for strong gains
- The precision-exploration conflict: locks demand low temperature (suppress distractors), forks demand high temperature (preserve alternatives) — no global setting can satisfy both
- SSD induces support compression (via truncation \(\rho_{\text{train}}\)) and within-support reshaping (via temperature \(T_{\text{train}}\)), which together sharpen locks into spikes and flatten forks into plateaus
- Training and evaluation temperatures compose through an effective temperature \(T_{\text{eff}} = T_{\text{train}} \cdot T_{\text{eval}}\), with a quadratic peak near \(T_{\text{eff}} \approx 1.2\)
- Even when training data is 62% gibberish (at \(T_{\text{train}} = 2.0\) without truncation), SSD still improves the model — the signal comes from distribution reshaping, not from training on correct code
- Decode-only tuning cannot match SSD because it is constrained by the base model’s existing ranking and cumulative curves
Comments
A surprisingly elegant result. The key insight — that models contain latent capability unrealized under any fixed decoding configuration — reframes self-improvement as unlocking existing knowledge rather than acquiring new knowledge. The precision-exploration conflict is a particularly clean abstraction that may apply beyond code generation to any structured generation task with a mix of constrained and open-ended positions.
The “bad data, good results” experiment (Section 4.4) is perhaps the most striking finding: training on mostly gibberish still improves the model. This strongly supports the claim that the benefit comes from distribution reshaping rather than learning from correct solutions, which distinguishes SSD from typical self-training approaches that rely on data quality.
The connection to Knowledge distillation is interesting: unlike traditional distillation where a larger teacher transfers knowledge to a smaller student, here the model is both teacher and student, with the “teaching signal” coming from the temperature-shifted sampling distribution rather than from a more capable model. The theoretical decomposition (Equation 4) into support compression, within-support reshaping, and base-model alignment terms makes this precise.
It would be interesting to see whether this approach transfers to other structured generation domains beyond code, and whether the lock/fork distinction holds for natural language generation tasks where correctness is less binary.
Connections
- Related to Knowledge distillation because SSD is a form of self-distillation where the model distills from its own temperature-shifted distribution
- Related to Language modeling because the method uses standard cross-entropy fine-tuning and the analysis centers on next-token distributions
- Related to Program synthesis because the evaluation domain is competitive programming / code generation
- Related to Reinforcement learning as a contrast — SSD explicitly avoids RL, reward models, and verifiers while achieving comparable improvements
- Related to Large language models because the method is evaluated across multiple LLM families (Qwen, Llama) at various scales
Bibliography
- Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, Yizhe Zhang. . "Embarrassingly Simple Self-Distillation Improves Code Generation". https://arxiv.org/abs/2604.01193.
Loading comments...