This is related to RLHF, but instead of relying on human scoring of outputs, it uses programmatically verifiable outcomes (such as unit tests for code, math proofs, etc.).
Reinforcement learning with verifiable rewards
Links to this note
- Knowledge Base Index
- Notes on: Embarrassingly Simple Self-Distillation Improves Code Generation by Zhang, R., Bai, R. H., Zheng, H., Jaitly, N., Collobert, R., & Zhang, Y. (2026)
- Notes on: Reinforcement Learning via Self-Distillation by Hübotter, J., Lübeck, F., Behric, L., Baumann, A., Bagatella, M., Marta, D., Hakimi, I., Shenfeld, I., Kleine Buening, T., Guestrin, C. & Krause, A. (2026)
Last changed | authored by Hugo Cisneros
Loading comments...