Self-Distillation Enables Continual Learning by Idan Shenfeld, Mehul Damani, Jonas Hübotter, Pulkit Agrawal (2026)

This note was initially drafted with LLM assistance. Generated notes are periodically reviewed and revised by the author.
tags
Continual learning, Catastrophic forgetting, Distillation, In-context learning, Large language models
source
(Shenfeld et al. 2026)

Summary

This paper introduces Self-Distillation Fine-Tuning (SDFT), an on-policy alternative to supervised fine-tuning (SFT) for continual learning from expert demonstrations. The motivation is a known asymmetry in post-training: on-policy reinforcement learning reduces catastrophic forgetting but requires explicit reward functions, while SFT works from cheap demonstrations but is inherently off-policy and tends to overwrite prior capabilities. SDFT closes this gap by exploiting the model’s own in-context learning ability — the same LLM, conditioned on a demonstration \(c\), acts as a teacher producing \(\pi(\cdot \mid x, c)\), while the unconditioned model acts as a student producing \(\pi_\theta(\cdot \mid x)\). Training samples responses on-policy from the student and minimizes the reverse KL divergence \(D_{\mathrm{KL}}(\pi_\theta(\cdot \mid x) \,\|\, \pi(\cdot \mid x, c))\), yielding token-level on-policy updates without any external reward.

A central theoretical contribution is the In-Context Assumption: a model conditioned on a demonstration approximates the unknown optimal post-update policy, \(\pi^*_{k+1}(y \mid x) \approx \pi(y \mid x, c)\). Under this assumption, SDFT is mathematically equivalent to on-policy RL with an implicit reward \(r(y, x, c) = \log \pi(y \mid x, c) - \log \pi(y \mid x)\) — placing the method in the inverse-reinforcement-learning family but with the reward extracted by in-context conditioning rather than learned from preference data or trajectory classifiers. Empirical validation on ToolAlpaca shows the demonstration-aware teacher reaches 100% reward and stays roughly half as far in KL from the base policy as the SFT-trained model (0.68 vs 1.26 nats), satisfying both optimality and minimal-deviation conditions.

Across three skill-learning tasks (Tool Use, Science Q&A, Medical reasoning) and a Knowledge Acquisition setting (Wikipedia articles on 2025 natural disasters, after the model’s knowledge cutoff), SDFT consistently dominates SFT on the new-task / prior-task Pareto frontier, beating CPT, SFT, SFT+Re-invoke, and offline distillation from the same teacher. In a sequential three-task experiment, SDFT accumulates Tool Use, Science Q&A, and Medical skills without regression, while SFT exhibits oscillatory forgetting. Gains scale with model size (Qwen2.5 3B → 7B → 14B widens the gap from −3.3 to +6.9 points on Science Q&A) and pass@k improves uniformly across \(k\) up to 128, ruling out entropy-collapse explanations. SDFT also enables training of reasoning models (Olmo-3-7B-Think) on answer-only datasets without collapsing chain-of-thought length — SFT degrades 31.2% → 23.5% on the medical task, SDFT improves to 43.7% while preserving response length.

Key Ideas

  • SDFT objective: sample \(y \sim \pi_\theta(\cdot \mid x)\), minimize reverse KL to the demonstration-conditioned teacher; teacher uses an EMA of student weights to stabilize training.
  • In-Context Assumption: the demonstration-conditioned policy \(\pi(\cdot \mid x, c)\) approximates the unknown optimal next policy \(\pi^*_{k+1}\) — verified empirically on ToolAlpaca (100% teacher accuracy, half the KL-from-base of SFT).
  • Implicit IRL: the SDFT gradient is mathematically equivalent to on-policy policy-gradient RL with reward \(r(y,x,c) = \log \pi(y \mid x, c) - \log \pi(y \mid x)\), no reward model required.
  • Token-level signal: the loss decomposes per-token, providing dense credit assignment compared to trajectory-level RL methods like GRPO.
  • Continual-learning result: on a sequential 3-task curriculum, a single SDFT model accumulates skills without performance regression on previously learned tasks; SFT shows severe forgetting.
  • Knowledge acquisition: on Wikipedia articles after the cutoff, SDFT reaches 89% strict / 100% lenient / 98% OOD accuracy, nearly matching oracle-RAG and far exceeding SFT (80/95/80) and CPT (9/37/7).
  • Scale matters: gap to SFT widens with model size, since stronger ICL gives a higher-quality teacher signal; 3B is too weak to be its own teacher.
  • Reasoning preservation: SDFT preserves long chain-of-thought when training on answer-only data because the demonstration-conditioned teacher still generates reasoning; SFT collapses CoT to match the short targets.
  • Pass@k uniformity: gains over base/SFT are flat in \(k\) up to 128 — SDFT acquires new skills rather than sharpening existing ones.
  • Cost: ~2.5× FLOPs and ~4× wall-clock vs SFT due to on-policy generation, but eliminates the need for a follow-up restoration phase like Re-invoke.
  • Failure mode: student can inherit teacher’s framing artifacts (“Based on the example…”); masking the loss over the first few tokens is an effective heuristic.

Comments

The conceptual contribution is more interesting than the algorithm. Framing in-context conditioning as a way to extract a per-instance reward from a demonstration is a clean unification of three previously separate ideas: in-context learning, on-policy distillation, and inverse RL. It also explains why SFT forgets: SFT trains on off-policy targets that may be far from the base policy, while SDFT explicitly enforces minimal deviation through the trust-region structure inherited from the IRL derivation.

The relationship to Hübotter et al.’s SDPO (RL via Self-Distillation) is striking — Shenfeld and Hübotter are coauthors here, and the two papers form a natural pair. SDPO uses rich tokenized feedback (e.g., runtime errors) as the conditioning signal \(c\) to extract a reward from a verifiable RL setting; SDFT uses demonstrations as the conditioning signal in a setting without rewards. Both reduce to on-policy reverse-KL distillation against a self-conditioned teacher. This suggests a more general “self-conditioned distillation” template where any auxiliary information that improves the model’s predictions can be turned into a training signal.

The knowledge-acquisition result is the most surprising finding. CPT (the classical recipe for ingesting new factual content) gets 9% strict accuracy; SDFT gets 89%, nearly matching oracle RAG. If this replicates, it would substantially change the post-training pipeline for keeping foundation models current — currently dominated by retrieval-augmentation precisely because parametric updates were thought to be inefficient.

The main caveat is the dependency on strong ICL: at 3B, SDFT underperforms SFT, so the approach is gated on serving large enough base models. Also, the “reasoning models without reasoning data” claim relies on the base model already producing CoT — SDFT preserves a behavior that already exists rather than inducing new behaviors.

Connections

Bibliography

  1. . . "Self-Distillation Enables Continual Learning". https://arxiv.org/abs/2601.19897.
Last changed | authored by

Comments

Loading comments...

Leave a comment

Back to Notes