Distillation regime where the student is trained on a fixed dataset of trajectories generated by a different distribution. typically the teacher, an earlier checkpoint, or a static expert corpus — rather than from the student’s own current policy.
Standard knowledge distillation and most supervised fine-tuning recipes are off-policy: the training samples do not depend on the student’s evolving distribution, so the student is supervised on states it would not itself visit at inference time.
Contrast with on-policy distillation
- Sample source: Off-policy: trajectories from teacher / dataset / fixed policy. On-policy: trajectories sampled from the current student.
- Distribution mismatch: Off-policy training optimizes the student on states it does not generate, leading to compounding errors at inference (Ross et al. 2011) and, in LLM post-training, to catastrophic forgetting when target trajectories are far from the base policy. On-policy distillation avoids this mismatch by definition.
- Cost. Off-policy is cheap, only one forward pass per training token, no generation loop. On-policy requires sampling rollouts during training (~2.5–4× the FLOPs and wall-clock of off-policy SFT in practice).
- Use cases. Off-policy dominates classical model compression (small student matches large teacher logits), SFT from human demonstrations, and context distillation from static prompts. On-policy is preferred when preserving the base policy matters (continual learning, alignment) or when dense credit assignment under the student’s own distribution is needed (RL-style post-training).
Loading comments...