Assessing the Consistency of Open-Source Large Language Models for Algorithm Evaluation

Authors

  • Ava Moazzez The Potomac School, McLean, VA
  • Aditya Barman University of Illinois Urbana Champaign, Champaign, IL
  • Achyut Nuli Bridgewater-Raritan High School, Bridgewater, NJ
  • Kashi Kamat Thomas Jefferson High School for Science and Technology, Alexandria, VA
  • Vibhav Katikaneni Independence High School, Ashburn, VA
  • Sarah Liang University of Southern California, Los Angeles, CA
  • Vineel Kandala Thomas Jefferson High School for Science and Technology, Alexandria, VA
  • Mihai Boicu Department of Information Sciences Technology, College of Engineering and Computing, George Mason University, Fairfax, VA

Abstract

The grading of open-ended questions in education is labor-intensive and subject to human error, making it an attractive target for automation through AI. Manual scoring from professionals, although thorough, is a time-consuming task that often lacks consistency and contains bias across algorithms and evaluators. Recent advancements in AI, particularly in large language models (LLMs) like GPT-4, have shown significant promise in this domain. Automated scoring methods require large amounts of training data to ensure generalizability, cost users greatly, and currently are not widely used for grading complex assignments. We tested various LLM prompting strategies such as Chain-of-Thought and comparative grading, and found that rubric-based grading has proven to offer transparency in grading, thorough feedback, and accurate scores compared to human grading. This work explores the consistency of four LLM software – Anthropic’s Claude, OpenAI’s ChatGPT, Microsoft Copilot, and Google Gemini – using rubric grading. Custom rubrics were developed to cover four criteria: algorithm design, completeness, clarity/readability, and logic. We tested the LLMs to provide feedback and grading for one program that intentionally contained various errors. For each LLM, four prompts, one for each rubric category, were inputted 42 times, resulting in a total of 672 data points along with the corresponding feedback. Statistical methods, including analysis of standard deviations and Intraclass Correlation Coefficient (ICC), were employed to evaluate the consistency of LLM grading and feedback per rubric category. The ICC values were utilized to assess the reliability of LLM grading across multiple trials, with high values indicating higher consistency.

Published

2024-10-13

Issue

Section

College of Engineering and Computing: Department of Information Sciences and Technology