Claude Opus 4.8 benchmark numbers vs GPT-5.5 are kinda concerning

Not trying to start a war here but I keep both subscriptions running and test models against each other regularly for work. Anthropic dropped Opus 4.8 today and some of these benchmark gaps are hard to ignore. SWE-Bench Pro: Opus 4.8 at 69.2% vs GPT-5.5 at 58.6%. Humanity's Last Exam (no tools): 49.8% vs 41.4%. Knowledge work (GDPval): 1890 vs 1769. Agentic financial analysis: 53.9% vs 51.8%. GPT-5.5 still wins on terminal coding (78.2% vs 74.6%) which honestly is where I get the most value day to day so it's not all bad news. But the coding benchmark gap going the other way is big.