Seeing vs. Believing: Evaluating the Language Bias of Open-Source MLLMs in Counter-Intuitive Scenes

ArXi:2601.07737v2 Announce Type: replace-cross Multimodal Large Language Models (MLLMs) have nstrated remarkable performance in mainstream visual understanding tasks, but their ability to process action scenes that contradict everyday common sense remains undertested. To address this gap, we