Analyzing Prompt Engineering for Code Generation Accuracy with o4-Mini on CodeForces Problems

Sunmay Padiyar; Aaroosh Kurchania; Daivya Singh; Mihai Boicu

Authors

Sunmay Padiyar Broad Run High School, Ashburn, VA
Aaroosh Kurchania Basis Chandler, Chandler, AZ
Daivya Singh University of Maryland, College Park, MD
Mihai Boicu Department of Information Sciences and Technology, George Mason University, Fairfax, VA

Abstract

Generative AI tools such as OpenAI’s o4-mini are increasingly used to aid programmers in solving coding tasks, but their effectiveness often depends on the quality of the prompts, which can be improved through a process known as prompt engineering. While previous studies have explored prompt engineering in software development contexts, its impact on competitive programming tasks, which require more precise reasoning and problem-solving, remains underexplored. In this study, we evaluate the effect of four different prompt strategies—No Prompt, Zero-Shot Chain-of-Thought (CoT), Prompt Chaining, and Advanced CoT—on o4-mini’s performance for 90 advanced Codeforces problems rated 1600–2400. These problems were released after May 31st, 2024, the official knowledge cutoff date for o4-mini, to reduce overlap with the model’s training data. AI-generated code was submitted into the official Codeforces grader, and results were measured by acceptance rate. Among all methods, Advanced CoT achieved the highest acceptance rate at 56.67%, significantly outperforming No Prompt (44.44%) and Zero-Shot CoT (42.22%), and marginally outperforming Prompt Chaining (55.56%). McNemar’s test (significance level of 0.0166 with a Bonferroni Correction) confirmed that the improvement from Advanced CoT over No Prompt was statistically significant (p = 0.0139), while the comparison between Prompt Chaining and No Prompt approached significance (p = 0.0249). Together, these results strongly support the hypothesis that carefully structured prompts can lead to increased performance from LLMs on high difficulty, reasoning-intensive tasks. Specifically, achieving over 56% accuracy on recently released and likely unseen problems demonstrates o4-mini's impressive reasoning capabilities when guided by effective prompting, especially given its affordability at just $1.10 per million tokens. Future work could expand this experiment to a broader, more diverse range of problems and to various LLMs to compare their performance with o4-mini.

Analyzing Prompt Engineering for Code Generation Accuracy with o4-Mini on CodeForces Problems

Authors

Abstract

Published

Issue

Section

License

assip