Value-Gradient Hypothesis of RL for LLMs

ArXi:2605.21654v1 Announce Type: cross Reinforcement learning substantially improves pretrained language models, but it remains understudied why critic-free methods such as PPO and GRPO work as well as they do, and when they should provide the largest gains. We develop a value-gradient perspective of critic-free RL for LLM post-