AI RESEARCH

Moment-KV: Momentum-Based Decode-Time KV Cache Compression for Long Generation

arXiv CS.AI

ArXi:2605.29873v1 Announce Type: new Key-Value (KV) cache remains a major bottleneck for deploying Large Language Models (LLMs) in long-generation tasks. Prior work often applies uniform compression across both prefill and decoding caches, but compressing the prefill cache degrades performance by corrupting critical context. While preserving the prefill cache is essential, decoding-phase compression remains underexplored, with existing methods relying on rigid recency windows or instantaneous attention.