AI RESEARCH
ART: Attention Run-time Termination for Efficient Large Language Model Decoding
arXiv CS.CL
•
ArXi:2606.00024v1 Announce Type: new Long-context decoding in Large Language Models (LLMs) is severely constrained by the memory bandwidth required to fetch the extensive Key-Value (KV) cache. Most existing KV management methods rely on key-only pruning before decoding, despite the evidence that attention outputs depend jointly on keys and values, as incorporating values in their methods incurs prohibitive additional overhead.