TimeSage-MT: A Multi-Turn Benchmark for Evaluating Agentic Time Series Reasoning

ArXi:2606.01498v1 Announce Type: cross Time series data inform critical decisions across many real-world domains. While large language model (LLM) agents can analyze data through natural language and tools, it remains unclear whether they can conduct reliable time series analysis across multi-turn conversations. Existing benchmarks focus on single-step tasks such as forecasting and anomaly detection, overlooking practical workflows where user goals evolve, agents must build on prior analyses, and.