Datacurve’s DeepSWE Exposes a Weird New Problem With AI Coding Leaderboards
The Neuron
•
Generative AI
AI Research
AI Tools
Datacurve’s DeepSWE benchmark crowns GPT-5.5 as the top coding agent, but the real story is bigger than one leaderboard. Its findings suggest AI coding benchmarks may be leaking answers, rejecting valid solutions, and suppressing useful agent behavior - right as enterprises start betting real engineering workflows on them.