FGRPO: Federated GRPO with Adaptive Aggregation on Non-IID Data

ArXi:2606.03094v1 Announce Type: new Recent advances in language models have established reinforcement learning as the primary paradigm for eliciting self-correction and long-chain reasoning. While group relative policy optimization (GRPO) offers superior scalability by eliminating the critic network, deploying it on a central infrastructure entails collecting a large volume of data from distributed owners, which poses significant privacy risks. To address these concerns, we