AI RESEARCH

UNIQUE: Universal Top-k Sparse Attention for Training-free Inference and Sparsity-aware Training

arXiv CS.CL

ArXi:2605.27740v1 Announce Type: new Long-context inference in large language models (LLMs) is bottlenecked by the linear growth of the self-attention key-value (KV) cache. Top-k sparse attention alleviates this by loading only a small fraction of the KV cache, but accurately and cheaply estimating cache importance, for both