AI RESEARCH
The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic
arXiv CS.AI
•
ArXi:2605.28700v1 Announce Type: new The GSM-Symbolic benchmark (Mirzadeh, 2025) reported consistent performance drops across 25 Large Language Models (LLMs) when tested on template-generated variants of GSM8K problems, concluding that the models lack genuine reasoning capabilities. We argue that this