Zai replaced the network architecture running GLM-5.1 inference and the gains are pretty wild
r/LocalLLaMA
•
AI Hardware
Been following the infrastructure side of AI lately and stumbled on this from Zai. They upgraded the network architecture on a thousand-GPU cluster running GLM-5.1 coding inference from the standard ROFT setup to something they built called ZCube, developed with Tsinghua University and HarnetsAI The numbers from production: - Switch and optical module costs down 33% - GPU inference throughput up 15% - P99 tail latency on first token dropped 40.6% Same GPUs, same software stack, same model. Just the network architecture changed The actual problem they were solving is interesting.