ViCA: Efficient Multimodal LLMs with Vision-Only Cross-Attention

ArXi:2602.07574v2 Announce Type: replace-cross Modern multimodal large language models (MLLMs) adopt a unified self-attention design that processes visual and textual tokens at every Transformer layer, incurring substantial computational overhead. In this work, we revisit the necessity of such dense visual processing and show that projected visual embeddings are already well-aligned with the language space, while effective vision-language interaction occurs in only a small subset of layers.