AI RESEARCH
RLVR without Ineffective Samples: Group Prioritized Off-Policy Optimization for LLM Reasoning
arXiv CS.AI
•
ArXi:2606.01281v1 Announce Type: cross Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs). However, its effectiveness is substantially hindered by the prevalence of ineffective