Long Live The Balance: Information Bottleneck Driven Tree-based Policy Optimization

ArXi:2605.28109v1 Announce Type: new Recent advances in online reinforcement learning (RL) for large language models (LLMs) have nstrated promising performance in complex reasoning tasks. However, they often exhibit an imbalanced exploration-exploitation trade-off, resulting in unstable optimization and sub-optimal performance. We