Hyperparameter Transfer with Mixture-of-Expert Layers

ArXi:2601.20205v3 Announce Type: replace Mixture-of-Experts (MoE) layers have emerged as an important tool in scaling up modern neural networks by decoupling total trainable parameters from activated parameters in the forward pass for each token. However, sparse MoEs add complexity to