Benchmarks for Vision-Language Models in Urban Perception Should Be Reliability-Aware and Negotiated

ArXi:2606.00871v1 Announce Type: cross Vision-language models (VLMs) are increasingly used to generate structured descriptions of street-level imagery for tasks such as streetscape auditing, mapping, and public consultation. These uses combine observable attributes with appraisal categories, and the human targets are often distributions of judgments with disagreement and explicit non-response.