Attend to Evidence: Evidence-Anchored Spatial Attention Supervision for Multimodal RLVR

ArXi:2605.30912v1 Announce Type: cross Reinforcement learning with verifiable rewards (RLVR) improves vision-language models (VLMs) by optimizing outcome rewards derived from final answers. However, such outcome-only rewards do not tell the model which image regions justify an answer. For questions that require visual grounding, these rewards cannot distinguish responses ed by relevant visual evidence from those produced by language-prior shortcuts or lucky guesses. We