Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data

ArXi:2606.05122v1 Announce Type: new Large language models are increasingly evaluated by other models, raising a natural question: can a model predict how a judge will score its own output? We find that the ability is largely present before any targeted