Accuracy, Stability, and Repeated-Run Reliability of Large Language Models on Deterministic Programming Tasks

ArXi:2606.00920v1 Announce Type: cross Run-level pass rate overstates retry-free coverage by up to 17.8%age points -- and the gap is largest precisely for mid-performing systems. We investigate this accuracy--stability relationship in large language model (LLM) evaluation for deterministic text-conditioned generation, using programming tasks as a concrete testbed.