First AI to Beat Every Human in a Programming Competition - Agentic GRPO Explained

Traditional RL for LLMs treats one answer as one trajectory: prompt > reasoning > final answer > reward Agentic systems are different: they call tools generate hypotheses run tests debug code summarize context revise plans loop many times before success That creates a hard RL problem: rewards arrive very late trajectories are very long the policy changes while rollouts are still running (“off-policy drift”) Agentic GRPO is meant to stabilize learning in this setting. First: what is GRPO? GRPO stands for Group Relative Policy Optimization.