- tags
- Continual learning, Catastrophic forgetting, Distillation, In-context learning, Large language models
- source
- (Shenfeld et al. 2026)
Summary
This paper introduces Self-Distillation Fine-Tuning (SDFT), an on-policy alternative to supervised fine-tuning (SFT) for continual learning from expert demonstrations. The motivation is a known asymmetry in post-training: on-policy reinforcement learning reduces catastrophic forgetting but requires explicit reward functions, while SFT works from cheap demonstrations but is inherently off-policy and tends to overwrite prior capabilities. SDFT closes this gap by exploiting the model’s own in-context learning ability — the same LLM, conditioned on a demonstration \(c\), acts as a teacher producing \(\pi(\cdot \mid x, c)\), while the unconditioned model acts as a student producing \(\pi_\theta(\cdot \mid x)\). Training samples responses on-policy from the student and minimizes the reverse KL divergence \(D_{\mathrm{KL}}(\pi_\theta(\cdot \mid x) \,\|\, \pi(\cdot \mid x, c))\), yielding token-level on-policy updates without any external reward.
A central theoretical contribution is the In-Context Assumption: a model conditioned on a demonstration approximates the unknown optimal post-update policy, \(\pi^*_{k+1}(y \mid x) \approx \pi(y \mid x, c)\). Under this assumption, SDFT is mathematically equivalent to on-policy RL with an implicit reward \(r(y, x, c) = \log \pi(y \mid x, c) - \log \pi(y \mid x)\) — placing the method in the inverse-reinforcement-learning family but with the reward extracted by in-context conditioning rather than learned from preference data or trajectory classifiers. Empirical validation on ToolAlpaca shows the demonstration-aware teacher reaches 100% reward and stays roughly half as far in KL from the base policy as the SFT-trained model (0.68 vs 1.26 nats), satisfying both optimality and minimal-deviation conditions.
Across three skill-learning tasks (Tool Use, Science Q&A, Medical reasoning) and a Knowledge Acquisition setting (Wikipedia articles on 2025 natural disasters, after the model’s knowledge cutoff), SDFT consistently dominates SFT on the new-task / prior-task Pareto frontier, beating CPT, SFT, SFT+Re-invoke, and offline distillation from the same teacher. In a sequential three-task experiment, SDFT accumulates Tool Use, Science Q&A, and Medical skills without regression, while SFT exhibits oscillatory forgetting. Gains scale with model size (Qwen2.5 3B → 7B → 14B widens the gap from −3.3 to +6.9 points on Science Q&A) and pass@k improves uniformly across \(k\) up to 128, ruling out entropy-collapse explanations. SDFT also enables training of reasoning models (Olmo-3-7B-Think) on answer-only datasets without collapsing chain-of-thought length — SFT degrades 31.2% → 23.5% on the medical task, SDFT improves to 43.7% while preserving response length.
Key Ideas
- SDFT objective: sample \(y \sim \pi_\theta(\cdot \mid x)\), minimize reverse KL to the demonstration-conditioned teacher; teacher uses an EMA of student weights to stabilize training.
- In-Context Assumption: the demonstration-conditioned policy \(\pi(\cdot \mid x, c)\) approximates the unknown optimal next policy \(\pi^*_{k+1}\) — verified empirically on ToolAlpaca (100% teacher accuracy, half the KL-from-base of SFT).
- Implicit IRL: the SDFT gradient is mathematically equivalent to on-policy policy-gradient RL with reward \(r(y,x,c) = \log \pi(y \mid x, c) - \log \pi(y \mid x)\), no reward model required.
- Token-level signal: the loss decomposes per-token, providing dense credit assignment compared to trajectory-level RL methods like GRPO.
- Continual-learning result: on a sequential 3-task curriculum, a single SDFT model accumulates skills without performance regression on previously learned tasks; SFT shows severe forgetting.
- Knowledge acquisition: on Wikipedia articles after the cutoff, SDFT reaches 89% strict / 100% lenient / 98% OOD accuracy, nearly matching oracle-RAG and far exceeding SFT (80/95/80) and CPT (9/37/7).
- Scale matters: gap to SFT widens with model size, since stronger ICL gives a higher-quality teacher signal; 3B is too weak to be its own teacher.
- Reasoning preservation: SDFT preserves long chain-of-thought when training on answer-only data because the demonstration-conditioned teacher still generates reasoning; SFT collapses CoT to match the short targets.
- Pass@k uniformity: gains over base/SFT are flat in \(k\) up to 128 — SDFT acquires new skills rather than sharpening existing ones.
- Cost: ~2.5× FLOPs and ~4× wall-clock vs SFT due to on-policy generation, but eliminates the need for a follow-up restoration phase like Re-invoke.
- Failure mode: student can inherit teacher’s framing artifacts (“Based on the example…”); masking the loss over the first few tokens is an effective heuristic.
Comments
The conceptual contribution is more interesting than the algorithm. Framing in-context conditioning as a way to extract a per-instance reward from a demonstration is a clean unification of three previously separate ideas: in-context learning, on-policy distillation, and inverse RL. It also explains why SFT forgets: SFT trains on off-policy targets that may be far from the base policy, while SDFT explicitly enforces minimal deviation through the trust-region structure inherited from the IRL derivation.
The relationship to Hübotter et al.’s SDPO (RL via Self-Distillation) is striking — Shenfeld and Hübotter are coauthors here, and the two papers form a natural pair. SDPO uses rich tokenized feedback (e.g., runtime errors) as the conditioning signal \(c\) to extract a reward from a verifiable RL setting; SDFT uses demonstrations as the conditioning signal in a setting without rewards. Both reduce to on-policy reverse-KL distillation against a self-conditioned teacher. This suggests a more general “self-conditioned distillation” template where any auxiliary information that improves the model’s predictions can be turned into a training signal.
The knowledge-acquisition result is the most surprising finding. CPT (the classical recipe for ingesting new factual content) gets 9% strict accuracy; SDFT gets 89%, nearly matching oracle RAG. If this replicates, it would substantially change the post-training pipeline for keeping foundation models current — currently dominated by retrieval-augmentation precisely because parametric updates were thought to be inefficient.
The main caveat is the dependency on strong ICL: at 3B, SDFT underperforms SFT, so the approach is gated on serving large enough base models. Also, the “reasoning models without reasoning data” claim relies on the base model already producing CoT — SDFT preserves a behavior that already exists rather than inducing new behaviors.
Connections
- Closely related to Reinforcement Learning via Self-Distillation (Hübotter et al. 2026) — same author group (Shenfeld is a coauthor of both); SDPO is the RL/verifiable-reward analogue of SDFT.
- Related to Embarrassingly Simple Self-Distillation (Zhang et al. 2026) — another self-distillation approach, but offline, with no demonstration conditioning, and aimed at unlocking latent capability rather than continual learning.
- Related to LoRA Learns Less and Forgets Less (Biderman et al. 2024) — alternative mitigation of forgetting via parameter-efficient updates; SDFT shows on-policy distribution alignment is a complementary mechanism.
- Related to Drinking from a Firehose (Hu et al. 2020) — earlier large-scale continual-learning study in a different regime.
- Builds on the framework of reinforcement learning with verifiable rewards by removing the verifiability requirement; orthogonal to RLHF which uses preference data to learn the reward.
Bibliography
- Idan Shenfeld, Mehul Damani, Jonas Hübotter, Pulkit Agrawal. . "Self-Distillation Enables Continual Learning". https://arxiv.org/abs/2601.19897.
Loading comments...