Poster
in
Workshop: 2nd AI for Math Workshop @ ICML 2025
VeriBench: End-to-End Formal Verification Benchmark for AI Code Generation in Lean 4
Brando Miranda · Zhanke Zhou · Allen Nie · Elyas Obbad · Leni Aniva · Kai Fronsdal · Weston Kirk · Dilara Soylu · Andrea Yu · Ying Li · Sanmi Koyejo
Formal verification of software is a promising and potentially transformative application of generative AI. Provably correct code would eliminate entire classes of vulnerabilities, mitigate critical system failures, and potentially transform software engineering practices through inherently trustworthy implementation methodologies. To advance this domain, we present \textbf{VeriBench}, a carefully curated benchmark for evaluating language-model capabilities in end-to-end code verification, requiring the generation of complete Lean 4 programs—implementations, unit tests, correctness theorems, and formal proofs—derived from reference Python functions or their docstrings. % Despite comprising just 51 HumanEval examples, 10 fundamental CS algorithm tests, 42 easy examples, and 11 computer security examples (a total of 113), current frontier models demonstrate significant limitations. Our evaluation on the 113-task suite—51 HumanEval problems, 42 easy exercises, 10 classical algorithms, and 11 security challenges—shows that current frontier models compile only a small fraction of programs.Claude 3.7 Sonnet achieves compilation on only 12.5\%, while LLaMA-70B fails to compile any programs in the Lean 4 HumanEval subset, even with 50 feedback-guided attempts. Notably, among the evaluated approaches, our experiments reveal that only a self-optimizing agent architecture achieves meaningful compilation rates, approaching 90\%. VeriBench establishes a rigorous foundation for developing AI systems capable of synthesizing provably correct, bug-free code, thereby advancing the trajectory toward more secure and dependable software infrastructure.