Diagnosing and Repairing LLM Proof Failures: An Error Taxonomy and APOLLO-Guided Corrections on MiniF2F

Authors

  • Max Levin Hunter College High School, New York, NY
  • Mihai Boicu Department of Information Sciences and Technology, George Mason University, Fairfax, VA

Abstract

An important challenge in artificial intelligence is enabling models to perform rigorous mathematical reasoning, which Automated Theorem Proving (ATP) addresses by constructing machine-verifiable proofs. Recent systems use Large Language Models (LLMs) with proof assistants such as Lean4 to automate formal reasoning. Despite strong performances on benchmarks like MiniF2F, current theorem-proving LLMs often rely on best-of-n sampling, which is computationally expensive. Error reflection systems like Lyra, DREAM, MA-LoT, and APOLLO use formal verifier error messages to guide subsequent proof attempts, APOLLO introducing a modular pipeline that reprompts an LLM on failed sub-problems. APOLLO achieves state-of-the-art accuracy on MiniF2F-test with a significantly reduced sampling budget. This work analyzes DeepSeek-Prover-V2-7B’s Pass@1 performance on MiniF2F-test, classifying its errors and determining which APOLLO mitigates. Using Chain-of-Thought prompting, the model attempted all 244 problems once. The resulting proofs were verified by the Lean4 compiler, and failures were categorized through a semantic analysis of the proofs and error messages. DeepSeek-Prover-V2-7B achieved a 49.2% success rate. Among failures, 47.6% were due to natural language reasoning errors (M-type), 37.1% of which were a repetitive generation breakdown subtype (L-type); 25.8% to logic translation errors (IP-type); 20.2% to tactic failures (U-type); 4.8% to inefficient strategies (T-type); and 1.6% to syntax errors (S-type). APOLLO, given a single pass per LLM call and limited recursion, corrected a subset of M/L, IP, and U errors, raising the success rate by 9.4%. A variant using o3-mini for initial reasoning to mitigate M/L-errors (APOLLO-SP) was created and tested, achieving a 6.6% gain, though it was limited by o3-mini’s failures to produce syntactically valid Lean4 proof sketches. The high rate of M/L-errors suggests that reasoning gaps are the primary barrier to improving Pass@1 performance. Future work should evaluate Pass@k performance, test other LLMs and repair frameworks, and develop targeted error correction strategies for the identified error types. Addressing higher-level reasoning and translation failures may require methods such as prompt engineering, fine-tuning NL-to-Lean translation, or agentic repair strategies beyond low-level tactic correction. This error classification and frequency data provides a clearer picture of current ATP limitations, enabling the further improvement of LLM performance in ATP through error reflection.

Published

2025-09-25

Issue

Section

College of Engineering and Computing: Department of Information Sciences and Technology