Reasoning or Fluency? Dissecting Probabilistic Confidence in Best-of-N Selection

ArXi:2601.13735v2 Announce Type: replace Probabilistic confidence metrics are increasingly adopted as proxies for reasoning quality in Best-of-N selection, under the assumption that higher confidence reflects higher reasoning fidelity. In this work, we challenge this assumption by investigating whether these metrics truly capture inter-step causal dependencies necessary for valid reasoning. We