Step 3.7 Flash open weights dropped TODAY and the agent reliability numbers are actually interesting

Read this release today. Some crazy numbers. The tau2-bench number is 98% across all difficulty levels. That is the one that got me because usually these releases post a strong easy score and then quietly die at hard difficulty. This one. claims it holds. For multi-step agent work that actually matters than most benchmarks. A model that drifts on step 4 of a 6 step chain is a debugging nightmare regardless of what its SWE score looks like. Raw capability is mid, Toolathlon at 49.5, GDPval at 45.8. So this is clearly a reliability play, not a frontier capability play.