Off-policy distillation

tags: Distillation, Supervised fine-tuning, Large language models

Distillation regime where the student is trained on a fixed dataset of trajectories generated by a different distribution. typically the teacher, an earlier checkpoint, or a static expert corpus — rather than from the student’s own current policy.

Standard knowledge distillation and most supervised fine-tuning recipes are off-policy: the training samples do not depend on the student’s evolving distribution, so the student is supervised on states it would not itself visit at inference time.

Contrast with on-policy distillation

Sample source: Off-policy: trajectories from teacher / dataset / fixed policy. On-policy: trajectories sampled from the current student.
Distribution mismatch: Off-policy training optimizes the student on states it does not generate, leading to compounding errors at inference (Ross et al. 2011) and, in LLM post-training, to catastrophic forgetting when target trajectories are far from the base policy. On-policy distillation avoids this mismatch by definition.
Cost. Off-policy is cheap, only one forward pass per training token, no generation loop. On-policy requires sampling rollouts during training (~2.5–4× the FLOPs and wall-clock of off-policy SFT in practice).
Use cases. Off-policy dominates classical model compression (small student matches large teacher logits), SFT from human demonstrations, and context distillation from static prompts. On-policy is preferred when preserving the base policy matters (continual learning, alignment) or when dense credit assignment under the student’s own distribution is needed (RL-style post-training).

Contrast with on-policy distillation

Bibliography

Links to this note

Comments

Leave a comment