AI RESEARCH

ViBE: Co-Optimizing Workload Skew and Hardware Variability for MoE Serving

arXiv CS.LG

ArXi:2606.00735v1 Announce Type: cross In distributed Mixture-of-Experts (MoE) inference, input-dependent token routing interacts with GPU performance variability to create persistent stragglers under synchronized execution, where the slowest GPU determines layer latency. This performance variability is inherent to modern accelerators: manufacturing variation, power limits, and thermal conditions