AI RESEARCH

Not All Transitions Matter: Evidence from PPO

arXiv CS.AI

Training a reinforcement learning agent on-policy means collecting fresh experience at every update, and that experience comes with a hidden problem. Each state in a rollout is the direct output of the previous one, causally chained together by the agent's own actions. They carry overlapping information, and the gradient signal the network receives ends up far repetitive than the batch size suggests.