Same task in github-copilot, pi, claude-code, and opencode with Qwen3.6 27B

I wanted to know how much of a coding agent's performance came from the model and how much came from the harness, so I vibed a setup to allow me to test multiple agentic harnesses/model combinations on the same task. ALl the images above all come from the same model, but with a different harness. Still working on getting automated/metric evaluation instead of subjective opinion. Things I noticed not present in the images: Opencode can search the internet by default. This made it's results way better on some tasks.