ClawMark Says Frontier Models Top Out at 55 on Coworker Tasks
ClawMark dropped on arXiv today. Forty-seven authors out of Evolvent AI built a 100-task benchmark across 13 professional domains for what they call coworker agents — agents you'd actually try to slot into a workplace alongside humans, across multiple working days, multiple services, raw multimodal evidence. Domains include research, content ops, HR, e-commerce, journalism, product management.
Top of leaderboard right now — GPT-5.4 at 55.0. Claude 4.6 Sonnet at 54.9. Qwen 3.6 Plus at 49.8. Gemini 3.1 Pro Preview at 39.3. MiniMax M2.7 at 34.4. The two frontier closed models tie at the top. Both fail more than 45 percent of tasks designed to look like normal coworker work.
The methodology call that matters here is rule-based scoring. No LLM-as-judge. Forty-seven authors went and wrote rules for every task instead of using GPT-5.5 to grade GPT-5.4. That's the right call — the eval crisis cluster (SWE-bench Verified contamination, DIVERT, OpenAI's deprecation announcement) keeps showing that scores from LLM-judges drift in ways the people building the benchmark can't audit. Going rule-based costs more author-hours and produces benchmarks that don't get gamed by the model they're scoring.
The editorial point — when you measure agents on actual coworker tasks across days and services and modalities, frontier models top out near 55. Not 80. Not 95. They are far from being functional coworkers. SciCrafter said the same thing in scientific discovery. DIVERT said it in tool use efficiency. ClawMark says it in workplace coworking. Three benchmarks, three angles, same answer.
Site: https://claw-mark.com/
Code: https://github.com/evolvent-ai/ClawMark
Paper: https://arxiv.org/abs/2604.23781
← Back to all articles
Top of leaderboard right now — GPT-5.4 at 55.0. Claude 4.6 Sonnet at 54.9. Qwen 3.6 Plus at 49.8. Gemini 3.1 Pro Preview at 39.3. MiniMax M2.7 at 34.4. The two frontier closed models tie at the top. Both fail more than 45 percent of tasks designed to look like normal coworker work.
The methodology call that matters here is rule-based scoring. No LLM-as-judge. Forty-seven authors went and wrote rules for every task instead of using GPT-5.5 to grade GPT-5.4. That's the right call — the eval crisis cluster (SWE-bench Verified contamination, DIVERT, OpenAI's deprecation announcement) keeps showing that scores from LLM-judges drift in ways the people building the benchmark can't audit. Going rule-based costs more author-hours and produces benchmarks that don't get gamed by the model they're scoring.
The editorial point — when you measure agents on actual coworker tasks across days and services and modalities, frontier models top out near 55. Not 80. Not 95. They are far from being functional coworkers. SciCrafter said the same thing in scientific discovery. DIVERT said it in tool use efficiency. ClawMark says it in workplace coworking. Three benchmarks, three angles, same answer.
Site: https://claw-mark.com/
Code: https://github.com/evolvent-ai/ClawMark
Paper: https://arxiv.org/abs/2604.23781
Comments