AI RESEARCH

Asking Is Not Enough: Protocol Sensitivity in LLM Confidence Calibration

arXiv CS.AI

ArXi:2605.27752v1 Announce Type: new LLM confidence calibration is often evaluated by comparing two signals: token-probability scores and verbalized confidence. These signals are sometimes treated as direct readouts of model uncertainty, but their comparison depends on measurement choices that are rarely made explicit. In the main analysis, we hold the verbalized-confidence elicitation fixed: a single prompt template, probability scale, and output format.