Why do we benchmark quants on perplexity and prose but never on tool call validity?

r/LocalLLaMA
AI Safety AI Research

The mixed precision quant discussion here lately, MoE aware stuff that keeps shared experts and the edge layers at higher precision is great, but it's almost all measured against perplexity and general output quality. What I never see is structured output. Tool call JSON, function schemas, constrained formats. My intuition, and I'd like to be wrong, is that those degrade earlier than prose does. A model at Q4_K_M can still write a perfectly readable paragraph while quietly producing JSON that's a brace short or hallucinating a field name.