StemBind: When MLLMs Get Lost Between Rules and Instances in Abstract Visual Reasoning

ArXi:2606.00148v1 Announce Type: cross Multimodal large language models (MLLMs) often know the rule but pick the wrong answer: on abstract visual reasoning (AVR) tasks, a model can describe what it sees and name the underlying pattern, yet still fail to choose the matching candidate. Existing AVR benchmarks cannot detect this because they collapse perception, rule induction, and answer selection into a single right-or-wrong signal. We