AI RESEARCH
ViBE: Co-Optimizing Workload Skew and Hardware Variability for MoE Serving
arXiv CS.LG
•
ArXi:2606.00735v1 Announce Type: cross In distributed Mixture-of-Experts (MoE) inference, input-dependent token routing interacts with GPU performance variability to create persistent stragglers under synchronized execution, where the slowest GPU determines layer latency. This performance variability is inherent to modern accelerators: manufacturing variation, power limits, and thermal conditions