AI RESEARCH

IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference

arXiv CS.LG

ArXi:2511.21513v2 Announce Type: replace Deploying Transformer models on edge devices is limited by latency and energy budgets. While INT8 quantization effectively accelerates the primary matrix multiplications, it exposes the softmax-related path as the dominant bottleneck. This stage incurs a costly dequantize -> softmax -> requantize detour, which can account for up to 65% of total attention latency and disrupts the end-to-end integer dataflow critical for edge hardware efficiency.