Vector Policy Optimization: Training for Diversity Improves Test-Time Search

ArXi:2605.22817v1 Announce Type: cross Language models must now generalize out of the box to novel environments and work inside inference-scaling search procedures, such as AlphaEvolve, that select rollouts with a variety of task-specific reward functions. Unfortunately, the standard paradigm of LLM post-