Can Retrieval Heads See Images? Multimodal Retrieval Heads in Long-Context Vision-Language Models

ArXi:2605.27243v1 Announce Type: new Large vision-language models increasingly rely on long-context modeling to reason over documents, hour-level videos, and long-horizon agent trajectories, requiring them to locate relevant evidence across interleaved text and images. Prior work has studied this behavior using retrieval heads in large language models, but its copy-based criterion does not directly apply when evidence appears in images. We