AI RESEARCH
IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference
arXiv CS.LG
•
ArXi:2511.21513v2 Announce Type: replace Deploying Transformer models on edge devices is limited by latency and energy budgets. While INT8 quantization effectively accelerates the primary matrix multiplications, it exposes the softmax-related path as the dominant bottleneck. This stage incurs a costly dequantize -> softmax -> requantize detour, which can account for up to 65% of total attention latency and disrupts the end-to-end integer dataflow critical for edge hardware efficiency.