May 9, 2026ResearchBenchmarkAgents

Microsoft Just Showed Frontier LLMs Quietly Wreck 25% of Your Documents

Microsoft Research, Philippe Laban + Tobias Schnabel + Jennifer Neville. They built a benchmark called DELEGATE-52 — 52 professional domains, simulating long delegated workflows where you hand the LLM a document and let it edit. Crystallography. Music notation. Legal contracts. Code. Then they ran 19 models including Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4.

Average corruption rate: 25% of document content silently broken by the end of the workflow. Frontier models are not exempt. Adding agentic tool use does not improve performance. The errors are sparse but severe — a wrong cell in a spreadsheet, a flipped sign in a chemistry formula, a deleted clause in a contract. Things you would not notice scrolling through but that change meaning entirely.

Three things make it worse: bigger documents, longer interaction sessions, and presence of distractor files. Exactly the conditions every real-world agentic workflow runs in. The current vibe of "give Claude Code 50 files and let it cook for an hour" is exactly the failure mode this paper isolates.

The HN thread blew past 305 points. A lot of people who use coding agents at scale recognized the symptom immediately. The paper reframes "hallucination" — it's not the agent making up wrong answers in chat. It's the agent quietly breaking your files while pretending it didn't.

arxiv.org/abs/2604.15597, code at github.com/microsoft/DELEGATE52. The reliability ceiling on agentic delegation is much lower than the marketing implies. This is the benchmark that finally measures it.
← Previous
SkillOS: A Skill Curator That Trains Itself
Next →
Super User Daily: 2026-05-10
← Back to all articles

Comments

Loading...
>_