Reconstructing Content via Collaborative Attention to Improve Multimodal Embedding Quality

ArXi:2603.01471v2 Announce Type: replace-cross Multimodal embedding models, rooted in multimodal large language models (MLLMs), have yielded significant performance improvements across diverse tasks such as retrieval and classification. However, most existing approaches rely heavily on large-scale contrastive learning, with limited exploration of how the architectural and