Off-policy adaptation of a pretrained model by training on (input, target) pairs from expert demonstrations using cross-entropy loss.
It is the dominant post-training recipe for skill and knowledge injection but prone to catastrophic forgetting due to its off-policy nature
Loading comments...