Do Multimodal Agents Really Benefit from Tool Use? A Systematic Study of Capability Gains

ArXi:2606.02357v1 Announce Type: cross Tool-augmented multimodal agents show strong benchmark gains, often taken as evidence that agents have learned to use tools. We argue that this interpretation can be premature: a tool-call trace alone does not show whether the tool supplied answer-critical information. We study two representative ``thinking with images'' agents, Thyme and DeepEyesV2, across real-world understanding, OCR, chart understanding, and mathematical reasoning.