VISTAQA: Benchmarking Joint Visual Question Answering and Pixel-Level Evidence

ArXi:2605.20676v1 Announce Type: new Establishing a clear link between model predictions and the visual evidence that s them is critical for transparency and reliability in multimodal reasoning, yet current multimodal large language model (MLLM) evaluations do not explicitly enforce this alignment. Existing benchmarks assess either textual answer correctness or pixel-level localization in isolation, leaving the coupling of reasoning and grounding an open challenge. We