Understanding and Mitigating Premature Confidence for Better LLM Reasoning

ArXi:2605.24396v1 Announce Type: new Long chains of thought (CoT) from current language models frequently contain logical gaps and unjustified leaps, limiting the gains from additional test-time compute. Improving reasoning quality directly would require process reward models, but the step-level annotations needed to train them are expensive and scarce.