March 29, 2026Research Benchmark Coding

SlopCodeBench: First Benchmark Measuring How Coding Agents Degrade Over Iterative Tasks

SlopCodeBench is a new benchmark that evaluates coding agents the way real software actually gets built — through repeated requirement changes and extensions. Unlike single-shot benchmarks like SWE-Bench, SlopCodeBench gives agents an initial spec, then forces them to extend their own code multiple times as new requirements arrive across 93 checkpoints.

The results are sobering: no agent solves any of the 20 problems end-to-end across 11 tested models. The highest checkpoint solve rate is just 17.2%. More concerning, code quality degrades steadily — erosion rises in 80% of trajectories and verbosity increases in 89.8%. Agent-generated code is 2.2x more verbose and markedly more eroded compared to equivalent open-source repositories.

Perhaps most striking: quality-aware prompts that instruct agents to write clean code reduce initial verbosity and erosion but do not slow the degradation rate, improve pass rates, or reduce cost. The degradation appears to be a fundamental property of iterative agent coding, not a prompting failure.

SlopCodeBench is released as an open, community-driven evaluation primitive with a dedicated website and MIT-licensed GitHub repository. It supports evaluation of Claude Code, OpenAI models, and Google models with configurable extended thinking levels.

For anyone deploying coding agents on real projects with evolving requirements, this benchmark provides the first empirical evidence of how agent code quality changes over sustained development — and the news is not good.

Paper: [arxiv.org/abs/2603.24755](https://arxiv.org/abs/2603.24755) | GitHub: [github.com/SprocketLab/slop-code-bench](https://github.com/SprocketLab/slop-code-bench) | Website: [scbench.ai](https://www.scbench.ai/)

← Previous

CrabTalk: An 8MB Open Source Agent Daemon Written in Rust

Cline Kanban: Finally, a Control Tower for Your Agent Swarm

← Back to all articles

SlopCodeBench: First Benchmark Measuring How Coding Agents Degrade Over Iterative Tasks

Related Articles

Comments