Distributed Tracing for LLM Agents: When MCP Makes Tool Calls Observable

How application observability extends to stochastic agent loops - and why the tool boundary matters. Production failures in LLM systems are often misattributed to the model. In practice, many incidents live in the action layer: a downstream API that time out, a tool that returns a business error inside a successful RPC, a subprocess the host spawned but never joined to the same trace. Standard logs capture completions; they rarely preserve the causal chain decision → tool invocation → observation → next decision. This article is about that gap.