Polarity Closes the 95-to-60 Gap Between Agent Evals and Production
Polarity launched on Product Hunt yesterday. Rank #12, 100 upvotes. The wedge is one of the most quoted numbers in agent ops right now: most teams hit 95 percent on eval suites but only 60 percent in production. Polarity sits in production traffic, watches every agent decision, surfaces failure patterns before users hit them, and feeds the failures back into evals to close the loop.
What it actually does. Three SDKs (Go, Python, TypeScript) drop into the agent runtime. The platform watches tool calls, guardrail behavior, latency, and decision points. When something looks wrong, real-time Slack alerts go out. Wrong tool called. Guardrail skipped. Tail-latency spike. The dashboard also lets you craft expected behaviors and tracks deviations against them, which is how the production data becomes new eval cases. Co-founders Alex U and Jay Chopra positioned it as the self-improvement stack: production traffic is the training set for the next eval suite.
Why this slot is filling up. Judgment Labs raised seed plus Series A last week, Galileo Agent Control has been around for two quarters, AgentRail shipped earlier in May, Plurai launched at the end of April. Agent observability is now its own category. The reason is brutal: production failure rates are not going down as model quality goes up, because the failure mode shifted from model output to agent decision making. Better models still skip the wrong tool, ignore the wrong guardrail, hand off at the wrong moment. The companies cleaning this up are the ones who own the production data.
Built on GitHub, Supabase, and OpenAI. Paid product, pricing not public yet. The 95-to-60 framing is repeatable enough that you should expect to hear it again in agent-ops conversations this quarter.
https://polarity.so
← Back to all articles
What it actually does. Three SDKs (Go, Python, TypeScript) drop into the agent runtime. The platform watches tool calls, guardrail behavior, latency, and decision points. When something looks wrong, real-time Slack alerts go out. Wrong tool called. Guardrail skipped. Tail-latency spike. The dashboard also lets you craft expected behaviors and tracks deviations against them, which is how the production data becomes new eval cases. Co-founders Alex U and Jay Chopra positioned it as the self-improvement stack: production traffic is the training set for the next eval suite.
Why this slot is filling up. Judgment Labs raised seed plus Series A last week, Galileo Agent Control has been around for two quarters, AgentRail shipped earlier in May, Plurai launched at the end of April. Agent observability is now its own category. The reason is brutal: production failure rates are not going down as model quality goes up, because the failure mode shifted from model output to agent decision making. Better models still skip the wrong tool, ignore the wrong guardrail, hand off at the wrong moment. The companies cleaning this up are the ones who own the production data.
Built on GitHub, Supabase, and OpenAI. Paid product, pricing not public yet. The 95-to-60 framing is repeatable enough that you should expect to hear it again in agent-ops conversations this quarter.
https://polarity.so
Comments