TGV-KV: Text-Grounded KV Eviction for Vision-Language Models

ArXi:2606.03075v1 Announce Type: new Vision-Language Models (VLMs) inherit the auto-regressive generation paradigm and cache the keys and values (KV) of all previous tokens to accelerate inference, resulting in memory consumption that scales linearly with context length. This issue is particularly pronounced in VLMs due to substantial redundancy in the visual modality.