Multimodal Function Vectors for Visual Relations

ArXi:2510.02528v2 Announce Type: replace Large Multimodal Models (LMMs) nstrate impressive in-context learning abilities from few multimodal nstrations, yet the internal mechanisms ing such task learning remain opaque. Building on prior work of Large Language Models, we show that a small subset of attention heads in Large Multimodal Models is responsible for transmitting representations of visual relations. The activations of these attention heads, termed function vectors, can be extracted and manipulated to alter an LMM's performance on relational tasks.