IndexMem: Learned KV-Cache Eviction with Latent Memory for Long-Context LLM Inference

ArXi:2605.25475v1 Announce Type: cross Large Language Models (LLMs) are increasingly expected to operate over long contexts, yet standard softmax attention incurs a KV cache that grows linearly with sequence length, quickly becoming the bottleneck for long context inference. A practical remedy is to evict less important KV entries; however, existing eviction policies are largely heuristic and struggle to capture the rich, input-dependent distribution of token importance. In this work, we