AI RESEARCH

Not All Disagreement Is Learnable: Token Teachability in On-Policy Distillation

arXiv CS.LG

ArXi:2605.26844v1 Announce Type: new On-policy distillation (OPD) trains a student on its own rollouts with token-level teacher supervision. Recent selective OPD methods exploit the non-uniformity of OPD signals by prioritizing high-entropy or high-disagreement tokens. We revisit this principle and ask: which token-level teacher signals are actually learnable? Using a fixed-context diagnostic that measures same-context teacher-student KL reduction, we show that raw KL disagreement is a coarse proxy for learning value.