AI RESEARCH
Beyond Normalization: Rethinking the Partition Function as a Difficulty Scheduler for RLVR
arXiv CS.AI
•
ArXi:2602.12642v2 Announce Type: replace-cross Reward-maximizing RL methods have shown to be capable of enhancing the reasoning performance of LLMs, but often lead to reduced generation diversity. Recent works address this issue by adopting GFlowNets