Assessing the Consistency of Open-Source Large Language Models for Algorithm Evaluation
Abstract
The grading of open-ended questions in education is labor-intensive and subject to human error, making it an attractive target for automation through AI. Manual scoring from professionals, although thorough, is a time-consuming task that often lacks consistency and contains bias across algorithms and evaluators. Recent advancements in AI, particularly in large language models (LLMs) like GPT-4, have shown significant promise in this domain. Automated scoring methods require large amounts of training data to ensure generalizability, cost users greatly, and currently are not widely used for grading complex assignments. We tested various LLM prompting strategies such as Chain-of-Thought and comparative grading, and found that rubric-based grading has proven to offer transparency in grading, thorough feedback, and accurate scores compared to human grading. This work explores the consistency of four LLM software – Anthropic’s Claude, OpenAI’s ChatGPT, Microsoft Copilot, and Google Gemini – using rubric grading. Custom rubrics were developed to cover four criteria: algorithm design, completeness, clarity/readability, and logic. We tested the LLMs to provide feedback and grading for one program that intentionally contained various errors. For each LLM, four prompts, one for each rubric category, were inputted 42 times, resulting in a total of 672 data points along with the corresponding feedback. Statistical methods, including analysis of standard deviations and Intraclass Correlation Coefficient (ICC), were employed to evaluate the consistency of LLM grading and feedback per rubric category. The ICC values were utilized to assess the reliability of LLM grading across multiple trials, with high values indicating higher consistency.
Published
Issue
Section
License
Copyright (c) 2024 Ava Moazzez, Aditya Barman, Achyut Nuli, Kashi Kamat, Vibhav Katikaneni, Sarah Liang, Vineel Kandala, Mihai Boicu
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.