Beyond Normalization: Rethinking the Partition Function as a Difficulty Scheduler for RLVR

ArXi:2602.12642v2 Announce Type: replace-cross Reward-maximizing RL methods have shown to be capable of enhancing the reasoning performance of LLMs, but often lead to reduced generation diversity. Recent works address this issue by adopting GFlowNets