How to Evaluate LLM Outputs: Building Evals That Actually Catch Regressions

Com Key Takeaways Most LLM eval setups fail for three structural reasons: evaluating on metrics that don't reflect production failure modes, using golden datasets that have silently rotted, and running evals on a separate schedule from deployments The four-layer eval stack - unit, reference, rubric, and behavioral - catches different regression types; shipping without all four leaves blind spots GPT-4 as judge agrees with human experts 85% of the time on general tasks ( Zheng, NeurIPS 2023 ), but that agreement drops to 60-68% in expert domains - calibrate before you trust it A February.