Towards Evaluation Engineering: An Empirical Study of ML Evaluation Harnesses in the Wild

ArXi:2605.24213v1 Announce Type: cross Evaluation harnesses are software systems that orchestrate model evaluation by managing model invocation, data loading, metric computation, and result reporting. Despite their critical role in machine learning infrastructure, their operational challenges and engineering concerns have received limited attention so far. We present an empirical study of 57 evaluation harnesses, deriving a five-stage harness model and classifying 16,560 issues by workflow stage and root cause.