AI RESEARCH
BudgetDraft: Acceptance-Aware Multi-View Training for Sparse-KV Speculative Decoding
arXiv CS.AI
•
ArXi:2606.00144v1 Announce Type: cross Speculative decoding speeds up autoregressive decoding by using a drafter to propose multiple tokens that a verifier validates in parallel. In resource-constrained deployments, the drafter uses a sparse KV cache to limit peak GPU memory and end-to-end latency under a fixed KV budget, while the verifier keeps a full KV cache. Mid-to-long context inference (4K--16K context length) is common in real applications.