AI RESEARCH
DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing
arXiv CS.LG
•
ArXi:2511.04791v2 Announce Type: replace Modern LLM serving systems must sustain high throughput while meeting strict latency SLOs across two distinct inference phases: compute-intensive prefill and memory-bound decode phases. Existing approaches either (1) aggregate both phases on shared GPUs, leading to interference between prefill and decode phases, which degrades Time-Between-Tokens (TBT); or (2) disaggregate the two phases across GPUs, improving latency but wasting resources through duplicated models and KV cache transfers.