AI RESEARCH
Accuracy, Stability, and Repeated-Run Reliability of Large Language Models on Deterministic Programming Tasks
arXiv CS.AI
•
ArXi:2606.00920v1 Announce Type: cross Run-level pass rate overstates retry-free coverage by up to 17.8%age points -- and the gap is largest precisely for mid-performing systems. We investigate this accuracy--stability relationship in large language model (LLM) evaluation for deterministic text-conditioned generation, using programming tasks as a concrete testbed.