Beyond the Literal: Decomposing Pragmatic Intent in Multimodal Meme Understanding

ArXi:2606.03604v1 Announce Type: new When asked what a meme or sarcastic post means, Large Vision Language Models (LVLMs) tend to describe what the image shows rather than what the author is trying to communicate. Standard instruction tuning entangles a post's literal content with its pragmatic meaning, letting surface-level details contaminate the final response.