Datacurve’s DeepSWE Exposes a Weird New Problem With AI Coding Leaderboards

Datacurve’s DeepSWE benchmark crowns GPT-5.5 as the top coding agent, but the real story is bigger than one leaderboard. Its findings suggest AI coding benchmarks may be leaking answers, rejecting valid solutions, and suppressing useful agent behavior - right as enterprises start betting real engineering workflows on them.