Probing the Preferences of a Language Model: Integrating Verbal and Behavioral Tests of AI Welfare

ArXi:2509.07961v2 Announce Type: replace We develop new experimental paradigms for measuring welfare in language models. We compare verbal reports of models about their preferences with preferences expressed through behavior when navigating a virtual environment and selecting conversation topics. We also test how costs and rewards affect behavior and whether responses to an eudaimonic welfare scale - measuring states such as autonomy and purpose in life - are stable across semantically equivalent prompts. Overall, we observed a notable degree of mutual between our measures.