CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks

ArXi:2606.03650v1 Announce Type: new Choosing or ranking language models for a specific application is hardest when no task-specific labeled data exists, and standard public benchmarks cannot be trusted, their items having likely leaked into pre