DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning

ArXi:2509.20912v4 Announce Type: replace Recent advances in multimodal language models (MLLMs) have made thinking with images a dominant paradigm for multimodal reasoning. However, existing methods still fail to ensure evidence-answer consistency, where correct answers must be ed by correct visual evidence. To address this issue, we propose DeFacto, a counterfactual reasoning framework that explicitly aligns visual evidence with final answers. Our approach integrates three complementary