Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models

ArXi:2605.20950v1 Announce Type: new Vision-Language Models (VLMs) face a bottleneck of prohibitive computational costs arising from massive visual token sequences during inference. Existing vision token reduction methods alleviate this burden, but they unintentionally preserve the isolated visual subject strictly aligned with the user's query, which fails to substantially explore salient subjects and their contextual relationships.