AI RESEARCH

Observation, Not Prediction: Conversation-Level Disaggregated Scheduling for Agentic Serving

arXiv CS.LG

ArXi:2606.01839v1 Announce Type: cross LLM-based agents resolve a user task through many turns of dependent inference and tool calls, producing a workload whose total cost is unknown when the task arrives. Existing multi-turn systems keep the turn as the scheduling unit and decide, turn by turn, whether to disaggregate prefill from decode. That decision rests on the turn's decode length, tool behavior, and KV growth, quantities that are not observable when the scheduler must act, forcing the system to predict them.