Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs

ArXi:2605.21602v1 Announce Type: new Many safety and alignment failures of large language models (LLMs) occur due to out-of-distribution (OOD) situations: unusual prompt or response patterns that are unforeseen by model developers. We systematically study whether LLM monitoring pipelines can detect these OOD alignment failures by