Lessons from the Trenches on Reproducible Evaluation of Language Models

ArXi:2405.14782v3 Announce Type: replace Reliable evaluation of language models (LMs) remains an open challenge. Re- searchers and engineers face methodological issues such as the sensitivity of models to evaluation setup, difficulty of proper comparisons across methods, and the lack of reproducibility and transparency. Evaluation difficulties are exacer- bated by the fracturing and siloing of information about conventions and common practices.