UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

ArXi:2307.00862v3 Announce Type: replace-cross Vision-language tasks, such as VQA, SNLI-VE, and VCR are challenging because they require the model's reasoning ability to understand the semantics of the visual world and natural language. Supervised methods working for vision-language tasks have been well-studied. However, solving these tasks in a zero-shot setting is less explored. Since Contrastive Language-Image Pre-