Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

r/LocalLLaMA
Generative AI

Long-context inference in large language models is bottlenecked by the quadratic cost of full attention. Existing efficient alternatives often rely either on native sparse