Your LLM-as-judge eval set is too small. Here is the math

About This Tutorial

How many human-labeled examples do you need to calibrate an LLM-as-judge against humans on your task? The default answer most teams use is "enough," which usually means whatever they had time to label. That answer is wrong in a specific, mathematically tractable way. The short version: if your judge has Cohen's kappa around 0.6 against humans and you want a 95% confidence interval no wider than 0.10, you need approximately 200 paired labels.