Towards Unified Vision-Language Models with Incomplete Multi-Modal Inputs

ArXi:2605.27894v1 Announce Type: new Video-Language Models (VLMs) have nstrated impressive multi-modal reasoning capabilities across diverse computer vision applications. However, these VLMs are task-specific and assume that both video and language inputs are complete. However, real-world VLM applications might face challenges due to deactivated sensors (e.g., cameras are unavailable due to data privacy), yielding modality-incomplete data and leading to inconsistency between