AI RESEARCH

AlphaQ: Calibration-Free Bit Allocation for Mixture-of-Experts Quantization

arXiv CS.LG

ArXi:2606.04980v1 Announce Type: new Mixture-of-Experts (MoE) architectures scale model capacity through sparse expert activation, but their deployment remains memory-bound because all expert weights must reside in memory. Mixed-precision quantization can substantially reduce this footprint by assigning different bit-widths to different experts. Existing approaches, however, typically rely on calibration data to estimate expert importance and determine bit allocation. For frontier MoE LLMs, the original.