CalArena: A Large-Scale Post-Hoc Calibration Benchmark

ArXi:2605.30188v1 Announce Type: cross Reliable probability estimates are critical in many machine learning applications, yet modern classifiers are often poorly calibrated. Post-hoc calibration provides a simple and widely used solution, but the large number of proposed methods, combined with small-scale and inconsistent evaluations, makes it difficult to determine which approaches are truly effective in practice. We