AI RESEARCH

OccamToken: Efficient VLM Inference with Training-Free and Budget-Adaptive Token Pruning

arXiv CS.AI

ArXi:2605.29657v1 Announce Type: cross Vision-language models (VLMs) rely on long visual token sequences for visual understanding, making the prefill stage expensive in both computation and memory. Most existing pruning methods follow an absolute-ranking paradigm, assigning importance scores to visual tokens and retaining a fixed top-K subset. In this work, we argue that this paradigm is fundamentally brittle: attention sinks distort token importance rankings, while image redundancy and query-dependent visual evidence make fixed token budgets unreliable across inputs. We propose OccamToken, a.