I gave ChatGPT the same task every month for a year. The "dumber" model won.

I run a tiny automation blog, so I test this stuff than is healthy. Once a month I handed the newest OpenAI model the exact same prompt: build me a 7-step workflow to triage my inbox. Then I scored it on one thing. Did it run without me babysitting it? Here's what got me. The bigger models wrote nicer answers. Longer, confident, full of "and here are six edge cases to consider." The cheaper, faster ones wrote stuff that actually worked on the first try often. Turns out reasoning power was almost never my bottleneck. I was.