LLM Judges Inconsistently Disagree Across Safety Criteria and Harm Categories

ArXi:2605.31381v1 Announce Type: new We evaluate the consistency of automated judges in conducting a multi-dimensional safety evaluation in a reference-free setup. Our results indicate that Large Language Models are unreliable judges in identifying safety issues related to machine-generated advice in regulated domains such as finance, although they are reliable at identifying overt forms of unsafe/harmful content such as violence.