AI RESEARCH

Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs via Fully Static Quantization

arXiv CS.LG

ArXi:2605.20295v1 Announce Type: new Large language models (LLMs) are increasingly deployed on mobile devices, where Neural Processing Units (NPUs) necessitate fully static quantization for optimal inference efficiency. However, existing post-