AsymVLM: Asymmetric Token Pruning for Efficient Vision-Language Model Inference

ArXi:2605.29535v1 Announce Type: new Vision-Language Models (VLMs) process thousands of visual tokens per image alongside comparatively few text tokens, yet existing compression methods treat both modalities uniformly. We observe that the two modalities have fundamentally different properties: vision tokens are spatially redundant and dominate prefill, while text tokens are causally dependent and accumulate during decoding.