GUI Agents Fall Apart Once They Cross Apps
WindowsWorld dropped April 30 from HIT-Shenzhen. 181 desktop tasks, 17 applications, average 5 sub-goals per task. 78 percent of tasks require working across multiple applications. The frontier-model results are brutal: under 21 percent success on multi-application workflows, and near-zero when conditional reasoning has to happen across three or more apps.
This is the gap that almost every GUI agent benchmark from 2025 hid. Single-app benchmarks (browse, fill form, save file) had agents at 60-70 percent. The moment the task involves 'find the number in this spreadsheet, paste it into the Word draft, then update the calendar invite,' performance collapses. Working memory across windows is where state actually lives in real office work, and the models keep losing it at app boundaries.
The benchmark itself is a strong artifact. Task design averages five sub-goals so partial credit is meaningful, the 17 apps are real Office plus browsers plus chat clients, and code plus eval is up at github.com/HITsz-TMG/WindowsWorld. This is the kind of benchmark that will get cited because every vendor claiming a 'desktop agent' (Microsoft Windows 11 Agentic Taskbar, Microsoft Foundry IQ, Manus My Computer, OpenAI Codex, Claude Code) now has a number to beat.
The deeper read: cross-app reasoning is the bottleneck for the entire 'agent does my actual job' pitch. Spreadsheet plus Word plus calendar plus Slack is white-collar work. If the consistent failure mode is at the window switch, the architecture problem isn't in the model β it's in the harness that loads which app's state into context at which step. WindowsWorld is the first benchmark that names that gap clearly enough to argue about.
Paper: https://arxiv.org/abs/2604.27776
← Back to all articles
This is the gap that almost every GUI agent benchmark from 2025 hid. Single-app benchmarks (browse, fill form, save file) had agents at 60-70 percent. The moment the task involves 'find the number in this spreadsheet, paste it into the Word draft, then update the calendar invite,' performance collapses. Working memory across windows is where state actually lives in real office work, and the models keep losing it at app boundaries.
The benchmark itself is a strong artifact. Task design averages five sub-goals so partial credit is meaningful, the 17 apps are real Office plus browsers plus chat clients, and code plus eval is up at github.com/HITsz-TMG/WindowsWorld. This is the kind of benchmark that will get cited because every vendor claiming a 'desktop agent' (Microsoft Windows 11 Agentic Taskbar, Microsoft Foundry IQ, Manus My Computer, OpenAI Codex, Claude Code) now has a number to beat.
The deeper read: cross-app reasoning is the bottleneck for the entire 'agent does my actual job' pitch. Spreadsheet plus Word plus calendar plus Slack is white-collar work. If the consistent failure mode is at the window switch, the architecture problem isn't in the model β it's in the harness that loads which app's state into context at which step. WindowsWorld is the first benchmark that names that gap clearly enough to argue about.
Paper: https://arxiv.org/abs/2604.27776
Comments