Prediction-Powered Inference Across Many Tasks for AI Evaluation & Social Science Research

ArXi:2605.29249v1 Announce Type: cross Many applications require statistically valid inference across many related tasks, while using only a handful of high-quality labels per hypothesis. In AI evaluation, these tasks may correspond to model behaviors across prompts, subgroups, or hypotheses; in social science surveys, they may correspond to related questions, populations, or measurement conditions.