Survive or Collapse: The Asymmetric Roles of Data Gating and Reward Grounding in Self-Play RL

ArXi:2605.22217v1 Announce Type: cross Self-play reinforcement learning trains language models on their own generated tasks, co-evolving a proposer and solver without human labels. Recent systems report strong reasoning gains, but collapse and instability are widely observed and poorly understood. The dominant response treats this as a reward-design problem. We argue instead that self-play stability is governed by two distinct levers: a data-level gate that decides which proposer-generated tasks enter the.