LongCat-Flash-Prover: Meituan's 560B Agentic Model Sets New Standard for Formal Reasoning
Meituan has open-sourced LongCat-Flash-Prover, a 560-billion-parameter Mixture-of-Experts model that advances formal mathematical reasoning through agentic tool-integrated reinforcement learning. The model sets a new state-of-the-art for open-weight models in both auto-formalization and theorem proving in Lean4.
The model decomposes formal reasoning into three independent capabilities — auto-formalization, sketching, and proving — and uses a novel Hierarchical Importance Sampling Policy Optimization (HisPO) algorithm to stabilize MoE training on long-horizon tasks. A gradient masking strategy accounts for policy staleness and train-inference engine discrepancies at both sequence and token levels.
The system employs a Hybrid-Experts Iteration Framework to expand high-quality task trajectories: generating formal statements from informal problems, producing whole proofs directly, or creating lemma-style sketches. Theorem consistency and legality detection mechanisms eliminate reward hacking.
For the agentic ecosystem, LongCat-Flash-Prover demonstrates how agentic RL training can push specialized reasoning far beyond what standard fine-tuning achieves. The tool-integrated approach — where the model learns to use Lean4's proof assistant as an external tool during RL — is a pattern that generalizes to any agent that needs to learn to use external tools effectively.
GitHub: https://github.com/meituan-longcat/LongCat-Flash-Prover
Paper: https://arxiv.org/abs/2603.21065
← Back to all articles
The model decomposes formal reasoning into three independent capabilities — auto-formalization, sketching, and proving — and uses a novel Hierarchical Importance Sampling Policy Optimization (HisPO) algorithm to stabilize MoE training on long-horizon tasks. A gradient masking strategy accounts for policy staleness and train-inference engine discrepancies at both sequence and token levels.
The system employs a Hybrid-Experts Iteration Framework to expand high-quality task trajectories: generating formal statements from informal problems, producing whole proofs directly, or creating lemma-style sketches. Theorem consistency and legality detection mechanisms eliminate reward hacking.
For the agentic ecosystem, LongCat-Flash-Prover demonstrates how agentic RL training can push specialized reasoning far beyond what standard fine-tuning achieves. The tool-integrated approach — where the model learns to use Lean4's proof assistant as an external tool during RL — is a pattern that generalizes to any agent that needs to learn to use external tools effectively.
GitHub: https://github.com/meituan-longcat/LongCat-Flash-Prover
Paper: https://arxiv.org/abs/2603.21065
Comments