RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator

ArXi:2605.21748v1 Announce Type: new As interactive LLM-based applications are created and refined, model developers need to evaluate the quality of generated text along many possible axes. For simpler systems, human evaluation may be practical, but in complicated systems like conversational chatbots, the amount of generated text can overwhelm human annotation resources. Model developers have begun to rely heavily on auto-evaluation, where LLMs are also used to judge generation quality.