When Models Disagree: Rethinking LLM Evaluation for Public Comment Analysis

ArXi:2605.29025v1 Announce Type: new Federal agencies are deploying large language models (LLMs) to categorize public comment corpora, where the model's organization of the record shapes what policymakers see and which arguments register. Standard evaluation, anchored on stance accuracy against a small validated set, cannot detect when different models produce materially different categorizations of the same public input.