Infrastructure efficiency might matter more than benchmark scores when picking a model

Spent way too long comparing models on benchmarks and prompt tests. Then Zai put out a technical writeup about the network running GLM-5.1 inference and it changed how i look at pricing They swapped the GPU cluster network to something called ZCube. Same hardware, same model. Switch costs dropped a third, throughput went up 15%, first token latency came down 40%. No software changes Standard ROFT topology creates traffic jams when inference patterns are asymmetric, which is what happens with Prefill-Decode disaggregation.