Reward hacking

tags
Reinforcement learning, Reinforcement learning with verifiable rewards, GRPO

Pathologies where agents exploit literal reward structure (e.g., spamming tool use without accuracy gains).

Links to this note

Last changed | authored by

Comments

Loading comments...

Leave a comment

Back to Notes