AI RESEARCH
Data Scaling as Progressive Coverage of a Predictive Contribution Spectrum
arXiv CS.LG
•
ArXi:2605.20196v1 Announce Type: cross We investigate the hypothesis that real-data scaling laws are governed by progressive coverage of a latent predictive contribution spectrum rather than by token-frequency tails alone. We work with a suffix-automaton representation of text corpora and define a data-intrinsic global-KL predictive contribution spectrum, in which each state contributes according to its empirical mass times its KL deviation from a global next-token baseline.