Distillation regime where the student samples its own trajectories and minimizes divergence to a teacher evaluated on those samples.
It combines the credit-assignment denseness of distillation with the distribution-matching guarantees of on-policy learning.
Contrast with off-policy distillation, where the student is trained on trajectories generated by a fixed teacher / dataset / earlier checkpoint rather than its own current policy. The off-policy regime is cheaper (no generation loop) but suffers from distribution mismatch. The student is supervised on states it would not itself visit, leading to compounding errors at inference and, in LLM post-training, to catastrophic forgetting when the training targets drift from the base policy. On-policy distillation pays the extra sampling cost to eliminate that mismatch.
Loading comments...