Strong Teacher Not Needed? On Distillation in LLM Pretraining

ArXi:2605.23857v1 Announce Type: new Knowledge distillation generally assumes a strong-to-weak relationship where stronger teachers yield better students. In this work, we examine this assumption about distillation in large language model pre