One-Way Policy Optimization for Self-Evolving LLMs

ArXi:2605.22156v1 Announce Type: cross Reinforcement Learning with Verifiable Rewards (RLVR) has become a promising paradigm for scaling reasoning capabilities of Large Language Models (LLMs). However, the sparsity of binary verifier rewards often leads to low efficiency and optimization instability. To stabilize