Using Reward Uncertainty to Induce Diverse Behaviour in Reinforcement Learning

ArXi:2606.03962v1 Announce Type: new Classical reinforcement learning (RL) typically seeks a deterministic policy that maximizes the expected sum of a scalar reward. Yet, modern applications such as language model fine-tuning or scientific discovery demand diversity. Existing remedies such as entropy regularization or diversity bonuses often require fragile trade-offs that sacrifice performance for stochasticity or rely on heuristic metrics that can misalign policy rankings. We argue that diversity is naturally understood as the rational response to uncertainty in the reward.