Entropy Sentinel: Continuous LLM Accuracy Monitoring from Decoding Entropy Traces in STEM

ArXi:2601.09001v4 Announce Type: replace Deploying LLMs raises two coupled challenges: (1) monitoring--estimating where a model underperforms as traffic and domains drift--and (2) improvement--prioritizing data acquisition to close the largest performance gaps. We test whether an inference-time signal can estimate slice-level accuracy under domain shift. For each response, we compute an output-entropy profile from final-layer next-token probabilities (from top-$k$ logprobs) and summarize it with different statistics.