AI RESEARCH
Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs via Fully Static Quantization
arXiv CS.LG
•
ArXi:2605.20295v1 Announce Type: new Large language models (LLMs) are increasingly deployed on mobile devices, where Neural Processing Units (NPUs) necessitate fully static quantization for optimal inference efficiency. However, existing post-