AI RESEARCH

EntmaxKV: Support-Aware Decoding for Entmax Attention

arXiv CS.CL

ArXi:2605.21649v1 Announce Type: cross Long-context decoding is increasingly limited by KV-cache memory traffic since each generated token attends over a cache whose size grows linearly with context length. Existing sparse decoding methods reduce this cost by selecting subsets of tokens or pages, but are designed for softmax attention, whose dense tails make any truncation discard nonzero probability mass.