Agent-ToM: Learning to Monitor Autonomous LLM Agents via Theory-of-Mind Reasoning

ArXi:2605.24216v1 Announce Type: cross Monitoring autonomous large language model (LLM) agents for covert malicious behavior is challenging due to delayed, context-dependent, and long-horizon attack patterns. Agents may pursue hidden objectives while maintaining superficially benign behavior, making detection difficult even with full trajectory access. Prior monitoring approaches improve scaffolding or ensemble aggregation, but treat each trajectory independently and do not learn from prior monitoring experience.