Hista and Numca: Estimate State Value Effectively for LLM Reinforcement Learning

ArXi:2605.29782v1 Announce Type: cross Reinforcement learning (RL) refines large language models (LLMs) by directly optimizing model behavior through reward signals. While accurate state value estimation is critical for stable