AI RESEARCH
GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs
arXiv CS.LG
•
ArXi:2605.23078v1 Announce Type: new Mixture-of-Experts Large Language Models (MoE-LLMs) achieve strong performance but incur substantial memory overhead due to massive expert parameters. Mixed-precision quantization mitigates this cost by allocating expert-wise bit-widths based on their importance, approaching the accuracy-memory Pareto frontier and enabling extreme low-bit quantization. However, existing methods rely on layer-wise importance estimation and overlook router shifts induced by quantization, resulting in suboptimal allocation and routing.