What's the theoretical basis for using llm consensus as a probability estimator for real world events [R]

This is a genuine technical question here. I've been looking at systems that use an ensemble of ai models to generate probability estimates for open ended real world events. The claim is that consensus across multiple models produces calibrated estimates than any single model. this makes sense intuitively and has parallels to ensemble methods in traditional ml. But I'm wondering about the theoretical underpinnings carefully. The standard ensemble argument relies on errors being somewhat uncorrelated across models.