Vegas: Self-Speculative Decoding with Verification-Guided Sparse Attention

ArXi:2602.07223v2 Announce Type: replace Long-context large language model (LLM) inference has become the norm for today's AI applications. However, it is severely bottlenecked by the increasing memory demands of its KV cache. Previous works have shown that self-speculative decoding with sparse attention, where tokens are drafted using a subset of the KV cache and verified in parallel against the full KV cache, speeds up inference in a lossless manner.