Skip to content

Study reveals half of AI-generated code fails human review despite passing tests

Automated benchmarks can't catch what developers see. A shocking report reveals AI's hidden code flaws—and why they're slipping through the cracks.

This picture contains a box which is in red, orange and blue color. On the top of the box, we see a...
This picture contains a box which is in red, orange and blue color. On the top of the box, we see a robot and text written as "AUTOBOT TRACKS". In the background, it is black in color and it is blurred.

Study reveals half of AI-generated code fails human review despite passing tests

A new study by research organisation METR has exposed major gaps in AI-generated code quality. Despite passing automated tests, nearly half of the solutions produced by leading models would be rejected by human developers. The findings raise questions about the reliability of popular coding benchmarks like SWE-bench Verified.

The study evaluated 296 AI-generated code contributions from five models: Claude 3.5 Sonnet, Claude 3.7 Sonnet, Claude 4 Opus, Claude 4.5 Sonnet, and GPT-5. Four experienced developers reviewed the submissions and found that around 50% failed to meet real-world standards, even after clearing automated checks. Rejections fell into three categories: poor code quality, damage to existing functionality, and fundamental errors.

A deeper analysis revealed that many failures stemmed from basic functional flaws rather than just stylistic issues. While earlier models like Claude 3.7 Sonnet produced more outright test failures, later versions such as Claude 4 Opus shifted problems toward subtler quality issues. GPT-5 lagged behind Anthropic's models in overall code quality.

The research also uncovered a sevenfold overestimation in long-term AI performance when using time-horizon methodology. Claude 4 Opus excelled at spotting logical errors (85% accuracy) but struggled with edge cases compared to Claude 4.5 Sonnet (92% vs. 87%). Both models overvalued trivial fixes by 12-15% while missing deeper integration flaws, leading to 22% more manual rejections.

METR concluded that SWE-bench Verified, a widely used benchmark, does not accurately reflect real-world AI capabilities. The study suggests automated tests alone are insufficient for judging code readiness.

The findings highlight a disconnect between benchmark results and practical software development. Nearly half of AI solutions passing SWE-bench Verified would still face rejection by maintainers. METR's report underscores the need for more rigorous evaluation methods before AI-generated code enters production.

Read also:

Latest