How much of MLE-Bench's gains are the algorithm vs. better models + more search? [R]

MLE-Bench scores have jumped from 30% to 80% over the last two years. But how much of that is real algorithmic progress vs. better base models + problem definition shifts + overfitting? Turns out: not much. Once you control for the same step budget and models, and then test on a different set of tasks, the two-year-old AIDE algorithm matches modern agent/evolutionary search systems.