AI RESEARCH
CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks
arXiv CS.CL
•
ArXi:2606.03650v1 Announce Type: new Choosing or ranking language models for a specific application is hardest when no task-specific labeled data exists, and standard public benchmarks cannot be trusted, their items having likely leaked into pre