Someone did an audit on the new DeepSWE, the results aren't pretty

While this post on the DeepSWE Benchmark github is mainly focused on DeepSeek failing in many places where it shouldn't, it shows many problems with how the benchmark was conducted. It seems that the benchmark was rushed out the door and still needs a lot work before it can be considered a reliable reference for the quality of the models they benchmarked. submitted by /u/pneuny [link] [comments]