VaaWIT: Visual-Aware Adaptation of Large Language Models for Multilingual Web Image Translation

ArXi:2605.24675v1 Announce Type: cross Translating text embedded in Web images is crucial for improving content accessibility and cross-lingual information retrieval, particularly within social media and e-commerce domains. Although Large Vision-Language Models (LVLMs) have advanced multimodal understanding, applying them to Web image translation remains challenging due to the visual representation gap: standard encoders often prioritize high-level semantics over the fine-grained visual details required for recognizing diverse character morphologies.