When Single Answer Is Not Enough: Rethinking Single-Step Retrosynthesis Benchmarks for LLMs

ArXi:2602.03554v2 Announce Type: replace-cross Recent progress has expanded the use of large language models (LLMs) in drug discovery, including synthesis planning. However, objective evaluation of retrosynthesis performance remains limited. Existing benchmarks and metrics typically rely on published synthetic procedures and Top-K accuracy based on single ground-truth, which does not capture the open-ended nature of real-world synthesis planning.