ARBOR: Online Process Rewards via a Reusable Rubric Buffer for Search Agents

ArXi:2606.03239v1 Announce Type: new LLM-based search agents are trained predominantly with outcome-only reward, leaving the search process itself unsupervised. This signal degenerates on outcome-homogeneous groups where all sampled trajectories share the same correctness, yielding zero within-group advantage and no gradient. Existing process supervision either trains a costly verifier or generates per-query rubrics that are inconsistent across queries and discarded after one use.