A Structured Benchmark for Text-Guided Anomaly Detection: When Language Stops Conditioning the Decision

ArXi:2606.01992v1 Announce Type: cross Industrial anomaly detection has historically been a unimodal task. Recent multimodal vision-language models have produced systems that admit textual input alongside the image and are presented as enabling text-guided zero- and few-shot inspection. Yet these methods are evaluated with protocols inherited from unimodal benchmarks that hold the textual condition constant and. therefore. cannot measure whether language conditions the decision; whether reported gains reflect text guidance or strong pretrained visual features remains open. We.