Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps
r/LocalLLaMA
•
Generative AI
Long-context inference in large language models is bottlenecked by the quadratic cost of full attention. Existing efficient alternatives often rely either on native sparse