SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?

ArXi:2605.30329v1 Announce Type: new Autonomous AI research agents aim to accelerate scientific discovery by automating the research pipeline, from hypothesis generation to peer review. However, existing benchmarks rarely test a fundamental bottleneck: whether Large Language Models can judge the methodological viability of a research idea before expending time and computational resources. We