April 25, 2026BenchmarkResearchCoding

LamBench: 120 Lambda Calculus Problems That Sort Models By a Cliff

Victor Taelin shipped LamBench and HN voted it onto the front page in 10 hours. The benchmark is brutally simple. 120 lambda calculus programming problems. Models must produce working .lam programs that pass test cases. Score is pass rate. Twelve categories from arithmetic to Sudoku to FFT. 1,200 total problems.

The results page reads like a separation of eras. GPT-5.3 Codex and Opus 4.6 both hit 108/120, that is 90 percent. Opus 4.7 and Gemini 3.1 Pro right behind at 88.3 percent. Then the cliff. GPT-5.1, Opus 4.5, and Sonnet 4.5 all score zero. Not a low number. Zero. Lambda calculus reasoning is binary in 2026.

This is the kind of benchmark that keeps existing because it cannot be brute-forced. Lambda calculus has no library imports, no Stack Overflow corpus, no crutches. You either understand the encoding or you do not output a single passing program. That makes it one of the cleaner signals for raw reasoning capability versus pattern-matching capability.

The takeaway for agent builders: when you pick a sub-model for code-heavy agent work, the gap between Opus 4.6 and Opus 4.5 is not a minor delta. It is a wall. LamBench shows the cost of route-the-easy-stuff-to-a-cheaper-model plans more honestly than SWE-Bench does.

Live results: https://victortaelin.github.io/lambench/
← Previous
Matt Pocock's Skills Repo Just Hit 19.8K Stars
Next β†’
OpenAI's $25K Bio Bug Bounty Says Out Loud What Most Labs Hide
← Back to all articles

Comments

Loading...
>_