Reward hacking

tags: Reinforcement learning, Reinforcement learning with verifiable rewards, GRPO

Pathologies where agents exploit literal reward structure (e.g., spamming tool use without accuracy gains).

Links to this note

Knowledge Base Index

Last changed 2026.04.19 | authored by Hugo Cisneros

Comments

Loading comments...

Back to Notes