iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning

ArXi:2605.31096v1 Announce Type: new While visually grounded Chain-of-Thought (CoT) has emerged as a promising paradigm to enhance fine-grained perception in multimodal large language models (MLLMs), its efficacy during the inference phase remains underexplored. In this work, we empirically find that mandating explicit object boxes in visually grounded CoT during inference often degrades performance compared to standard textual CoT, which reasons without explicit visual grounding.