AI RESEARCH

Memory-Efficient LLM Pretraining via Minimalist Optimizer Design

arXiv CS.AI

Training large language models (LLMs) relies on adaptive optimizers such as Adam, which introduce extra operations and require significantly memory to maintain first- and second-order moments than SGD. While recent works such as GaLore, Fira and APOLLO have proposed state-compressed memory-efficient variants, a fundamental question remains: What are the minimum modifications to plain SGD needed to match state-of-the-art pretraining performance? We systematically investigate this question us