Locality Matters for Training-Free Audio Token Compression in Audio-Language Models

ArXi:2605.25179v1 Announce Type: new Audio-language models (ALMs) are increasingly used for audio captioning, question answering, and open-ended audio understanding, but their inference cost remains high when audio inputs are represented as long prefix-token sequences. These audio prefixes consume context budget, increase memory usage, and make deployment harder in resource-constrained or latency-sensitive settings. Existing