AI RESEARCH

WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving

arXiv CS.LG

ArXi:2512.09472v2 Announce Type: replace-cross Deploying multiple models within shared GPU clusters is a key strategy to improve resource efficiency in large language model (LLM) serving. Existing multi-LLM serving systems improve GPU utilization at the cost of degraded inference performance, particularly time-to-first-token (TTFT). We attribute this degradation to the lack of awareness regarding future workload characteristics. In contrast, recent analyses have shown the strong periodicity and long-term predictability of real-world LLM serving workloads.