Developmental Cognitive Interpretability: A Research Agenda for Modelling Generalisation and Predicting Agent Behaviour

Summary Safe deployment of an AI system requires that we can make confident claims about its behaviour on out-of-distribution deployment inputs on the basis of only pre-deployment evaluations. One approach to making such claims is to take a cognitive perspective, in which we interpret the AIs behaviour in terms of latent cognitive constructs, such as motivations, intentions, and goals. Because the same behaviour may be compatible with a range of underlying cognition - such as scheming, fitness-seeking, or aligned motivations - inferring cognition from a behavioural snapshot can be tricky.