Diagnosing and Repairing LLM Proof Failures: An Error Taxonomy and APOLLO-Guided Corrections on MiniF2F
Abstract
An important challenge in artificial intelligence is enabling models to perform rigorous mathematical reasoning, which Automated Theorem Proving (ATP) addresses by constructing machine-verifiable proofs. Recent systems use Large Language Models (LLMs) with proof assistants such as Lean4 to automate formal reasoning. Despite strong performances on benchmarks like MiniF2F, current theorem-proving LLMs often rely on best-of-n sampling, which is computationally expensive. Error reflection systems like Lyra, DREAM, MA-LoT, and APOLLO use formal verifier error messages to guide subsequent proof attempts, APOLLO introducing a modular pipeline that reprompts an LLM on failed sub-problems. APOLLO achieves state-of-the-art accuracy on MiniF2F-test with a significantly reduced sampling budget. This work analyzes DeepSeek-Prover-V2-7B’s Pass@1 performance on MiniF2F-test, classifying its errors and determining which APOLLO mitigates. Using Chain-of-Thought prompting, the model attempted all 244 problems once. The resulting proofs were verified by the Lean4 compiler, and failures were categorized through a semantic analysis of the proofs and error messages. DeepSeek-Prover-V2-7B achieved a 49.2% success rate. Among failures, 47.6% were due to natural language reasoning errors (M-type), 37.1% of which were a repetitive generation breakdown subtype (L-type); 25.8% to logic translation errors (IP-type); 20.2% to tactic failures (U-type); 4.8% to inefficient strategies (T-type); and 1.6% to syntax errors (S-type). APOLLO, given a single pass per LLM call and limited recursion, corrected a subset of M/L, IP, and U errors, raising the success rate by 9.4%. A variant using o3-mini for initial reasoning to mitigate M/L-errors (APOLLO-SP) was created and tested, achieving a 6.6% gain, though it was limited by o3-mini’s failures to produce syntactically valid Lean4 proof sketches. The high rate of M/L-errors suggests that reasoning gaps are the primary barrier to improving Pass@1 performance. Future work should evaluate Pass@k performance, test other LLMs and repair frameworks, and develop targeted error correction strategies for the identified error types. Addressing higher-level reasoning and translation failures may require methods such as prompt engineering, fine-tuning NL-to-Lean translation, or agentic repair strategies beyond low-level tactic correction. This error classification and frequency data provides a clearer picture of current ATP limitations, enabling the further improvement of LLM performance in ATP through error reflection.
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.