XRPO: Pushing the limits of GRPO with Targeted Exploration and Exploitation

ArXi:2510.06672v3 Announce Type: replace Reinforcement learning algorithms such as GRPO have driven recent advances in large language model (LLM) reasoning. While scaling the number of rollouts stabilizes