Reinforcement learning with verifiable rewards

tags: Machine learning, Reinforcement learning, LLM

This is related to RLHF, but instead of relying on human scoring of outputs, it uses programmatically verifiable outcomes (such as unit tests for code, math proofs, etc.).

Links to this note

Knowledge Base Index
Notes on: Embarrassingly Simple Self-Distillation Improves Code Generation by Zhang, R., Bai, R. H., Zheng, H., Jaitly, N., Collobert, R., & Zhang, Y. (2026)
Notes on: Reinforcement Learning via Self-Distillation by Hübotter, J., Lübeck, F., Behric, L., Baumann, A., Bagatella, M., Marta, D., Hakimi, I., Shenfeld, I., Kleine Buening, T., Guestrin, C. & Krause, A. (2026)

Last changed 2026.04.08 | authored by Hugo Cisneros

Comments

Loading comments...

Back to Notes

Reinforcement learning with verifiable rewards

Links to this note

Comments

Leave a comment