One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs

ArXi:2605.22297v1 Announce Type: cross Learning rate configuration is a fundamental aspect of modern deep learning. The prevailing practice of applying a uniform learning rate across all layers overlooks the structural heterogeneity of Transformers, potentially limiting their effectiveness as the backbone of Large Language Models (LLMs). In this paper, we