AI RESEARCH

BudgetDraft: Acceptance-Aware Multi-View Training for Sparse-KV Speculative Decoding

arXiv CS.AI

ArXi:2606.00144v1 Announce Type: cross Speculative decoding speeds up autoregressive decoding by using a drafter to propose multiple tokens that a verifier validates in parallel. In resource-constrained deployments, the drafter uses a sparse KV cache to limit peak GPU memory and end-to-end latency under a fixed KV budget, while the verifier keeps a full KV cache. Mid-to-long context inference (4K--16K context length) is common in real applications.