AI RESEARCH

ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention

arXiv CS.LG

ArXi:2605.23081v1 Announce Type: new Efficient attention algorithms are critical to mitigate the quadratic cost of attention in long-context workloads. Prior work utilises block-scaled quantisation techniques on Blackwell GPUs to move attention computation to 4-bit precision to accelerate inference. However, these techniques result in significant quality degradation in long-context settings.