Quantitative analysis of feedback received from Claude 3.5 Sonnet on mathematical programming problems using a multi-dimensional rubric framework
Abstract
Large Language Model-based auto-graders are emerging as a paramount educational technology tool. However, one must ensure their feedback is accurate, beneficial, and consistent before integrating them into education. Previous studies have shown their auto-grading efficiency in mathematics and science and the importance of prompt engineering, but they focus less on understanding mathematical programming problems. A pre-experiment revealed that Claude 3.5 Sonnet produced the most consistent and accurate feedback for such problems compared to other models, like Microsoft Copilot and Meta Llama 3. In this experiment, four researchers solved five Project Euler problems of increasing difficulty (5-25%) in 40 minutes each, asking for graded feedback from Claude in a revision cycle. Claude graded their solutions using a 22-criteria rubric and a prompt-engineered template designed for consistency. The researchers compared pre/post-revision scores across rubric categories and problem difficulties. The results showed an average improvement of 17.5 points. An ANOVA test resulted in a p-value of 7.36x10-33 for the increase in points. The categories with the most improvement on average included time complexity (+25.45%), efficiency (+22.59%), and edge case handling (+22%), with the least improved being naming conventions (+0.83%). Claude’s feedback was most effective for problems with 10-15% difficulty. Our findings show that LLMs have the potential to greatly increase grades for code by providing feedback. Further research should compare the effectiveness of LLM and human grader feedback with students in an undergraduate-level course, or use models specific to the field with a more diverse dataset,
Published
Issue
Section
License
Copyright (c) 2024 Joel Raj, Ashwath Muppa, Achyut Dipukumar, Rhea Nirmal, Aarush Laddha, Teo Kamath, Sophie Hong, Meghana Potla, Mihai Boicu
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.