FEM-Bench: A Structured Scientific Reasoning Benchmark for Evaluating Code-Generating LLMs

ArXi:2512.20732v2 Announce Type: replace-cross As LLMs advance their reasoning capabilities about the physical world, the absence of rigorous benchmarks for evaluating their ability to generate scientifically valid physical models has become a critical gap. Computational mechanics, which develops and applies mathematical models and numerical methods to predict the behavior of physical systems under forces, deformation, and constraints, provides an ideal foundation for structured scientific reasoning evaluation.