The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic

ArXi:2605.28700v1 Announce Type: new The GSM-Symbolic benchmark (Mirzadeh, 2025) reported consistent performance drops across 25 Large Language Models (LLMs) when tested on template-generated variants of GSM8K problems, concluding that the models lack genuine reasoning capabilities. We argue that this