ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling

ArXi:2601.21198v2 Announce Type: replace-cross While Mixture-of-Experts (MoE) architectures substantially bolster the expressive power of large-language models, their prohibitive memory footprint severely impedes the practical deployment on resource-constrained edge devices, especially when model behavior must be preserved without relying on lossy quantization. In this paper, we present ZipMoE, an efficient and semantically lossless on-device MoE serving system.