AI RESEARCH
AlphaQ: Calibration-Free Bit Allocation for Mixture-of-Experts Quantization
arXiv CS.LG
•
ArXi:2606.04980v1 Announce Type: new Mixture-of-Experts (MoE) architectures scale model capacity through sparse expert activation, but their deployment remains memory-bound because all expert weights must reside in memory. Mixed-precision quantization can substantially reduce this footprint by assigning different bit-widths to different experts. Existing approaches, however, typically rely on calibration data to estimate expert importance and determine bit allocation. For frontier MoE LLMs, the original.