AI RESEARCH

Beyond Two-Stage Training: Cooperative SFT and RL for LLM Reasoning

arXiv CS.CL

ArXi:2509.06948v3 Announce Type: replace Supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR) are two widely used post-