AI RESEARCH
ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention
arXiv CS.LG
•
ArXi:2605.23081v1 Announce Type: new Efficient attention algorithms are critical to mitigate the quadratic cost of attention in long-context workloads. Prior work utilises block-scaled quantisation techniques on Blackwell GPUs to move attention computation to 4-bit precision to accelerate inference. However, these techniques result in significant quality degradation in long-context settings.