SWE-Mutation: Can LLMs Generate Reliable Test Suites in Software Engineering?

ArXi:2605.22175v1 Announce Type: cross Evaluating software engineering capabilities has become a core component of modern large language models (LLMs); however, the key bottleneck hindering further scaling lies not in the scarcity of high-quality solutions, but in the lack of high-quality test suites. Test suites are indispensable both for synthesizing program repair trajectories and for providing precise feedback signals in reinforcement learning.