FALSIFYBENCH: Evaluating Inductive Reasoning in LLMs with Rule Discovery Games

ArXi:2606.04751v1 Announce Type: new Large language models (LLMs) are increasingly deployed as autonomous agents in scientific tasks. Yet whether these systems can effectively engage in forms of inductive reasoning relevant to scientific discovery remains an open question. In this work, we