April 14, 2026Research Agents Monitoring

CodeTracer: Finally, a Way to Debug AI Agents That Actually Debug Code

Code agents are getting powerful. They can fix bugs, refactor, interact with terminals. But when they fail, good luck figuring out why. An early misstep cascades through parallel tool calls and multi-stage workflows into a mess of hidden error chains. You know it went wrong somewhere. You just cannot find where.

CodeTracer from Nanjing University and Kuaishou Technology tackles this head-on. It is a tracing architecture that reconstructs the full state transition history of a code agent as a hierarchical trace tree with persistent memory. Then it performs failure onset localization, pinpointing exactly where the agent first went off track and how that error cascaded downstream.

The team built CodeTraceBench, a large-scale benchmark from executed trajectories of four widely used code agent frameworks across bug fixing, refactoring, and terminal interaction tasks. Every trajectory has supervision at both the stage and step levels for failure localization.

Experiments show CodeTracer substantially outperforms direct prompting and lightweight baselines at finding where agents fail. This is not just academic. If you are running code agents in production, the difference between a 10 minute debug session and a 2 hour debug session often comes down to whether you can trace back to the first wrong turn.

https://arxiv.org/abs/2604.11641

← Previous

CocoaBench: Best AI Agent Scores 45%. That's the Best.

N-Day-Bench: Can Your LLM Actually Find Bugs, or Just Talk About Them?

← Back to all articles

CodeTracer: Finally, a Way to Debug AI Agents That Actually Debug Code

More Articles

Comments