Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data

ArXi:2601.15158v4 Announce Type: replace-cross Transformers trained via Reinforcement Learning (RL) with outcome-based supervision can spontaneously develop the ability to generate intermediate reasoning steps (Chain-of-Thought). Yet the mechanism by which sparse rewards drive policy gradient to discover such systematic reasoning remains poorly understood. We address this by analyzing the policy gradient dynamics of single-layer Transformers on a synthetic graph traversal task that cannot be solved without Chain-of-Thought but admits a simple iterative solution. We prove that despite.