Mistral Built a Proof Machine That Beats Opus for Cheap
Mistral just shipped Leanstral 1.5, and the number that should stop you is 587 of 672. That's PutnamBench, competition math that stumps most models, and this thing solves 587 of them. It saturates miniF2F at 100 percent, hits state of the art on FATE-H and FATE-X, and on real proof-engineering work it beats Claude Opus 4.6 at a fraction of the cost. All open, Apache-2.0, on Hugging Face with a free API.
What it actually is: a 119B mixture-of-experts model that only fires 6B parameters at a time, tuned for one thing, writing formal proofs in Lean 4. Not chatty math, real machine-checked proofs where the compiler either accepts it or it doesn't. No hand-waving, no hallucinated lemmas that look right. Either the proof type-checks or it fails.
The part that matters beyond math: it found five previously-unreported bugs in real open-source repos and proved the time-complexity guarantees of an AVL tree. Formal verification used to be a PhD's full-time job. Point this at a codebase and it starts proving your code correct, or showing you where it isn't.
And the test-time scaling curve is the tell for where this goes. At 50k tokens it solves 44 Putnam problems. Give it 4M tokens and it solves 587. More compute, more proofs, almost linearly. That's the shape of a tool you throw an overnight GPU budget at and wake up to a verified codebase.
Link: mistral.ai/news/leanstral-1-5
← Back to all articles
What it actually is: a 119B mixture-of-experts model that only fires 6B parameters at a time, tuned for one thing, writing formal proofs in Lean 4. Not chatty math, real machine-checked proofs where the compiler either accepts it or it doesn't. No hand-waving, no hallucinated lemmas that look right. Either the proof type-checks or it fails.
The part that matters beyond math: it found five previously-unreported bugs in real open-source repos and proved the time-complexity guarantees of an AVL tree. Formal verification used to be a PhD's full-time job. Point this at a codebase and it starts proving your code correct, or showing you where it isn't.
And the test-time scaling curve is the tell for where this goes. At 50k tokens it solves 44 Putnam problems. Give it 4M tokens and it solves 587. More compute, more proofs, almost linearly. That's the shape of a tool you throw an overnight GPU budget at and wake up to a verified codebase.
Link: mistral.ai/news/leanstral-1-5
Comments