Distill-then-Replace: Efficient Task-Specific Hybrid Attention Model Construction

ArXi:2601.11667v2 Announce Type: replace Transformer architectures deliver state-of-the-art accuracy via dense full-attention, but their quadratic time and memory complexity with respect to sequence length limits practical deployment. Linear attention mechanisms offer linear or near-linear scaling yet often incur performance degradation. Hybrid models that integrate full and linear attention layers promise a balance between efficiency and expressiveness, but face two major challenges.