Quantitative analysis of feedback received from Claude 3.5 Sonnet on mathematical programming problems using a multi-dimensional rubric framework

Authors

  • Joel Raj John F. Kennedy Memorial High School, Iselin, NJ
  • Ashwath Muppa Thomas Jefferson High School for Science and Technology, Alexandria, VA
  • Achyut Dipukumar Chantilly High School, Chantilly, VA
  • Rhea Nirmal Academies of Loudoun, Leesburg, VA and Freedom High School, South Riding, VA
  • Aarush Laddha Mission San Jose High School, Fremont, CA
  • Teo Kamath C.G. Woodson High School, Fairfax, VA
  • Sophie Hong Beverly Hills High School, Beverly Hills, CA
  • Meghana Potla Chantilly High School, Chantilly, VA
  • Mihai Boicu Department of Information Sciences Technology, College of Engineering and Computing, George Mason University, Fairfax, VA

Abstract

Large Language Model-based auto-graders are emerging as a paramount educational technology tool. However, one must ensure their feedback is accurate, beneficial, and consistent before integrating them into education. Previous studies have shown their auto-grading efficiency in mathematics and science and the importance of prompt engineering, but they focus less on understanding mathematical programming problems. A pre-experiment revealed that Claude 3.5 Sonnet produced the most consistent and accurate feedback for such problems compared to other models, like Microsoft Copilot and Meta Llama 3. In this experiment, four researchers solved five Project Euler problems of increasing difficulty (5-25%) in 40 minutes each, asking for graded feedback from Claude in a revision cycle. Claude graded their solutions using a 22-criteria rubric and a prompt-engineered template designed for consistency. The researchers compared pre/post-revision scores across rubric categories and problem difficulties. The results showed an average improvement of 17.5 points. An ANOVA test resulted in a p-value of 7.36x10-33 for the increase in points. The categories with the most improvement on average included time complexity (+25.45%), efficiency (+22.59%), and edge case handling (+22%), with the least improved being naming conventions (+0.83%). Claude’s feedback was most effective for problems with 10-15% difficulty. Our findings show that LLMs have the potential to greatly increase grades for code by providing feedback. Further research should compare the effectiveness of LLM and human grader feedback with students in an undergraduate-level course, or use models specific to the field with a more diverse dataset,

Published

2024-10-13

Issue

Section

College of Engineering and Computing: Department of Information Sciences and Technology