Expected Value Alignment for Generative Reward Modeling in Formal Mathematics Verification

ArXi:2606.01160v1 Announce Type: new Large Language Models (LLMs) are increasingly used with formal interactive theorem provers such as Lean 4. Scaling these systems with reinforcement learning or search methods requires process reward models (PRMs) that can evaluate intermediate reasoning steps. Existing reward-model designs expose a practical trade-off.