Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity

ArXi:2605.28640v1 Announce Type: new Efficient inference is critical for long-context language models, where attention computation and KV-cache access dominate the cost. Recent work