May 10, 2026ResearchMonitoringAgents

PrefixGuard Catches Agent Failures Before the Final Output

Xiaowei Huang's group at Liverpool put up PrefixGuard on arXiv four days ago. Subtitle is the actual product — from LLM-agent traces to online failure-warning monitors. The premise is the boring truth that final-outcome evals always miss: by the time you score the agent's last action, the agent has already taken twelve actions you would have stopped if you had been watching.

The pipeline is two stages. StepView reads raw agent traces and induces structured adapters — basically a lossy compression that keeps the bits a monitor needs. Then a supervised classifier trains on outcome labels to assign a risk score at each step. Result is a deterministic finite automaton with 20-29 states for simple tasks, blowing up to 151-187 states for complex ones. Effectively the paper extracts a state machine of how a particular agent fails on a particular benchmark.

Numbers are concrete. AUPRC of 0.900 on WebArena, 0.710 on tau-squared-Bench, 0.533 on SkillsBench, 0.557 on TerminalBench. Average +0.137 AUPRC over text-classifier baselines. The interesting finding sits in the discussion: strong ranking performance does not guarantee deployment utility. A monitor that flags the right traces but at the wrong step is useless — early-warning is the constraint, not just discrimination.

Sits in the cluster forming with re_gent (audit-and-rollback for Claude Code tool calls, last week), AgentTrust (runtime safety, May 7), and now PrefixGuard (failure prediction). Three structurally distinct slots in the harness-safety stack — record, constrain, predict. Pairs especially well with the Instrumental Choices benchmark dropped the same week — that paper measures whether the failures happen, this paper learns to catch them mid-trace.

arxiv.org/abs/2605.06455. No code released yet. Worth watching the github page from this group through the next two weeks — the four benchmarks they ran on are all open, and a working PrefixGuard reference impl would slot directly into the production-monitoring layer that re_gent and AgentTrust are building underneath.
← Previous
Andriushchenko Built the Benchmark That Actually Catches Self-Preservation in Agents
Next →
Super User Daily: 2026-05-11
← Back to all articles

Comments

Loading...
>_