Analyzing Prompt Engineering for Code Generation Accuracy with o4-Mini on CodeForces Problems

Authors

  • Sunmay Padiyar Broad Run High School, Ashburn, VA
  • Aaroosh Kurchania Basis Chandler, Chandler, AZ
  • Daivya Singh University of Maryland, College Park, MD
  • Mihai Boicu Department of Information Sciences and Technology, George Mason University, Fairfax, VA

Abstract

Generative AI tools such as OpenAI’s o4-mini are increasingly used to aid programmers in solving coding tasks, but their effectiveness often depends on the quality of the prompts, which can be improved through a process known as prompt engineering. While previous studies have explored prompt engineering in software development contexts, its impact on competitive programming tasks, which require more precise reasoning and problem-solving, remains underexplored. In this study, we evaluate the effect of four different prompt strategies—No Prompt, Zero-Shot Chain-of-Thought (CoT), Prompt Chaining, and Advanced CoT—on o4-mini’s performance for 90 advanced Codeforces problems rated 1600–2400. These problems were released after May 31st, 2024, the official knowledge cutoff date for o4-mini, to reduce overlap with the model’s training data. AI-generated code was submitted into the official Codeforces grader, and results were measured by acceptance rate. Among all methods, Advanced CoT achieved the highest acceptance rate at 56.67%, significantly outperforming No Prompt (44.44%) and Zero-Shot CoT (42.22%), and marginally outperforming Prompt Chaining (55.56%). McNemar’s test (significance level of 0.0166 with a Bonferroni Correction) confirmed that the improvement from Advanced CoT over No Prompt was statistically significant (p = 0.0139), while the comparison between Prompt Chaining and No Prompt approached significance (p = 0.0249). Together, these results strongly support the hypothesis that carefully structured prompts can lead to increased performance from LLMs on high difficulty, reasoning-intensive tasks. Specifically, achieving over 56% accuracy on recently released and likely unseen problems demonstrates o4-mini's impressive reasoning capabilities when guided by effective prompting, especially given its affordability at just $1.10 per million tokens. Future work could expand this experiment to a broader, more diverse range of problems and to various LLMs to compare their performance with o4-mini. 

Published

2025-09-25

Issue

Section

College of Engineering and Computing: Department of Information Sciences and Technology