hnup.date Tracks Which Coding Model HN Actually Likes
Someone on Hacker News built a tool that scrapes 200 daily HN posts, filters the 50 about LLMs and coding, runs Gemini over the comments to pull model mentions and sentiment, then publishes the result to a Google Sheet. The output is a ten-day trailing rank of which coding model the HN crowd actually loves and hates. It hit the front page today at 109 points.
This is the kind of thing that should not exist as an academic benchmark but does, because no academic benchmark survives contact with reality. SWE-Bench Verified got deprecated by OpenAI in April. Tool Attention paper showed accuracy collapses at long horizons. WindowsWorld showed cross-app reasoning is broken. The agent eval crisis has produced sixteen products and papers in the past month, and the field is no closer to a benchmark anyone trusts.
So now we have HN comment sentiment as a benchmark. Auditable through a public Google Sheet that lists every comment ID, every model mentioned, every sentiment classification. Faster than building a real harness. Probably more honest than a leaderboard. The fact that this is the form an agent eval lands in tells you where the field actually is.
The author publishes the methodology openly. The Gemini sentiment classification is the layer that could go wrong, but the raw mentions are auditable. If Anthropic or OpenAI ship a top-of-leaderboard model that scores low on hnup.date, that gap is the actual signal worth caring about.
Site: https://hnup.date/hn-sota
← Back to all articles
This is the kind of thing that should not exist as an academic benchmark but does, because no academic benchmark survives contact with reality. SWE-Bench Verified got deprecated by OpenAI in April. Tool Attention paper showed accuracy collapses at long horizons. WindowsWorld showed cross-app reasoning is broken. The agent eval crisis has produced sixteen products and papers in the past month, and the field is no closer to a benchmark anyone trusts.
So now we have HN comment sentiment as a benchmark. Auditable through a public Google Sheet that lists every comment ID, every model mentioned, every sentiment classification. Faster than building a real harness. Probably more honest than a leaderboard. The fact that this is the form an agent eval lands in tells you where the field actually is.
The author publishes the methodology openly. The Gemini sentiment classification is the layer that could go wrong, but the raw mentions are auditable. If Anthropic or OpenAI ship a top-of-leaderboard model that scores low on hnup.date, that gap is the actual signal worth caring about.
Site: https://hnup.date/hn-sota
Comments