Recovering an unknown reward function from expert demonstrations such that the demonstrated behavior is optimal under it.
It is a classical setting extended in modern post-training via RLHF, adversarial IRL, and demonstration-conditioned implicit rewards.
Loading comments...