Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

ArXi:2512.19673v3 Announce Type: replace-cross Existing reinforcement learning (RL) approaches treat large language models (LLMs) as a unified policy, overlooking their internal mechanisms. In this paper, we decompose the LLM-based policy into Internal Layer Policies and Internal Modular Policies via the Transformer's residual stream.