AI RESEARCH
BASIS: Batchwise Advantage Estimation from Single-Rollout Information Sharing for LLM Reasoning
arXiv CS.LG
•
ArXi:2605.27293v1 Announce Type: new Reinforcement learning with verifiable rewards has become a standard recipe for improving the reasoning abilities of large language models. Existing algorithms face a tradeoff between computational efficiency and sample efficiency in value estimation and policy learning. We