Off-Policy Learning to Reason Works Because It Is More Pessimistic Than You Think

ArXi:2605.28150v1 Announce Type: new Large scale reinforcement learning has become a central tool for improving reasoning in large language models. At this scale, generation is often lagged or asynchronous, so updates are performed on data collected by older policies. This makes learning inherently off-policy. Most existing approaches nevertheless remain rooted in PPO-style trust-region objectives, treating