ECHO: Entropy-Confidence Hybrid Optimization for Test-Time Reinforcement Learning

ArXi:2602.02150v2 Announce Type: replace-cross Test-time reinforcement learning generates multiple candidate answers via repeated rollouts and performs online updates using pseudo-labels constructed by majority voting. To reduce overhead and improve exploration, prior work