AI RESEARCH
DFlare: Scaling Up Draft Capacity for Block Diffusion Speculative Decoding
arXiv CS.CL
•
ArXi:2606.02091v1 Announce Type: new Block diffusion speculative decoding accelerates LLM inference by predicting all tokens within a block simultaneously for the target model to verify in parallel. Predicting an entire block at once requires a sufficiently capable draft model and effective utilization of the target model's internal knowledge. However, the state-of-the-art method DFlash constrains all draft layers to share a single fused representation derived from only a few target layers, limiting per-layer expressiveness and hindering further scaling of draft capacity.