This is related to RLHF, but instead of relying on human scoring of outputs, it uses programmatically verifiable outcomes (such as unit tests for code, math proofs, etc.).
Reinforcement learning with verifiable rewards
Links to this note
- Notes on: DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning by Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, Xing Yu (2025)
- Agentic reinforcement learning
- Knowledge Base Index
- Notes on: Embarrassingly Simple Self-Distillation Improves Code Generation by Zhang, R., Bai, R. H., Zheng, H., Jaitly, N., Collobert, R., & Zhang, Y. (2026)
- Notes on: GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery by Fengxiang Wang, Mingshuo Chen, Yueying Li, Yajie Yang, Yifan Zhang, Long Lan, Xue Yang, Hongda Sun, Yulin Wang, Di Wang, Jun Song, Jing Zhang, Bo Du (2026)
- Notes on: Reinforcement Learning via Self-Distillation by Hübotter, J., Lübeck, F., Behric, L., Baumann, A., Bagatella, M., Marta, D., Hakimi, I., Shenfeld, I., Kleine Buening, T., Guestrin, C. & Krause, A. (2026)
- Notes on: Self-Distillation Enables Continual Learning by Idan Shenfeld, Mehul Damani, Jonas Hübotter, Pulkit Agrawal (2026)
- Reward hacking
- Reward shaping
Last changed | authored by Hugo Cisneros
Loading comments...