BASIS: Batchwise Advantage Estimation from Single-Rollout Information Sharing for LLM Reasoning

ArXi:2605.27293v1 Announce Type: new Reinforcement learning with verifiable rewards has become a standard recipe for improving the reasoning abilities of large language models. Existing algorithms face a tradeoff between computational efficiency and sample efficiency in value estimation and policy learning. We