AgentDoG 1.5 Is a 1B Watchdog That Beats GPT-5.4 on Safety
AgentDoG 1.5 dropped on Hugging Face as today's #1 paper with 81 upvotes. It is a family of agent-safety models — 0.8B, 2B, 4B, 8B — that judge whether an agent's trajectory is safe, diagnose why if it is not, and can run inline as a guardrail that blocks unsafe outputs before they ship. Models and datasets are openly released.
The benchmark line is the part that matters. The 4B variant hits 92.2% on R-Judge and matches GPT-5.4 and Gemini-3.1-Pro for trajectory-level judgment. The 0.8B variant still beats every closed-source baseline on fine-grained risk diagnosis. And the whole thing was trained on roughly one thousand examples — not millions — because they used influence functions to keep only the highest-signal samples from a 32k synthesized pool.
The structure beneath it is what makes this a real category and not a bolt-on filter. Risk gets decomposed into source, failure mode, real-world harm, with separate reward channels for each. ATBench-Codex covers repository / shell / MCP scenarios. ATBench-Claw covers multi-session and approval flows. This is what an agent safety layer actually looks like at scale — a small dedicated model, not a system prompt, running in front of every reply.
https://arxiv.org/abs/2605.29801
← Back to all articles
The benchmark line is the part that matters. The 4B variant hits 92.2% on R-Judge and matches GPT-5.4 and Gemini-3.1-Pro for trajectory-level judgment. The 0.8B variant still beats every closed-source baseline on fine-grained risk diagnosis. And the whole thing was trained on roughly one thousand examples — not millions — because they used influence functions to keep only the highest-signal samples from a 32k synthesized pool.
The structure beneath it is what makes this a real category and not a bolt-on filter. Risk gets decomposed into source, failure mode, real-world harm, with separate reward channels for each. ATBench-Codex covers repository / shell / MCP scenarios. ATBench-Claw covers multi-session and approval flows. This is what an agent safety layer actually looks like at scale — a small dedicated model, not a system prompt, running in front of every reply.
https://arxiv.org/abs/2605.29801
Comments