Someone did an audit on the new DeepSWE, the results aren't pretty
r/singularity
•
Open Source AI
AI Research
While this post on the DeepSWE Benchmark github is mainly focused on DeepSeek failing in many places where it shouldn't, it shows many problems with how the benchmark was conducted. It seems that the benchmark was rushed out the door and still needs a lot work before it can be considered a reliable reference for the quality of the models they benchmarked. submitted by /u/pneuny [link] [comments]