Real Images, Worse Judgments: Evaluating Vision-Language Models on Concreteness and Imagery

ArXi:2605.27315v1 Announce Type: new Visual inputs are often assumed to improve language understanding in multimodal models. We examine this assumption by asking whether vision-language models (VLMs) can distinguish useful visual evidence from incidental image context in lexical judgments. We use human concreteness and imagery ratings because they span words with varying expected visual relevance, from abstract and low-imagery words to concrete and high-imagery words.