IVR-R1: Refining Trajectories through Iterative Visual-Grounded Reasoning in Reinforcement Learning

ArXi:2605.23997v1 Announce Type: cross Multimodal large language models via reinforcement learning (RL) have nstrated remarkable capabilities in complex visual reasoning tasks, yet they remain limited in long-horizon multimodal scenarios, often suffering from visual hallucination and logical error. Current methods typically pre-encode high-dimensional visual scenes into discrete textual proxies to facilitate downstream reasoning.