A Comparative Study of Gemini 2.5 Pro and Human Evaluative Metrics in Assessing Student Comprehension Across STEM and Humanities Topics through Conversation-based Assessments
Abstract
To build Argumentation and Graph-based Open Student Modeling Software (ARGOS), an Intelligent Tutoring System (ITS) with a dynamic Open Student Model (OSM), the project utilized Gemini 2.5 Pro as the Large Language Model (LLM) to power the project. To validate the OSM, an experiment was conducted to evaluate whether Gemini can accurately assess a student's mastery through rubric-based grading. The 100-point rubric analyzed the following areas of student performance: accuracy, clear reasoning, conciseness and clarity, relevance, and effort. Eight simulated student conversations, four in algebra and four in history, were submitted line by line to Gemini, along with the system prompt and rubric. The system prompt instructed Gemini to return the student's mastery score for each turn at the very end of the conversation. To establish a baseline for comparison, four human researchers independently scored each conversation using the same rubric. The analysis revealed a strong Pearson correlation coefficient (r = 0.837) between Gemini's overall mastery scores and human averages, with a consistent positive or negative difference between turns. However, Gemini consistently scored students about 10 percent lower on average and showed more inconsistency in its scoring (SD = 14.66). While Gemini performed well on math problems with clearly defined steps, it struggled with open-ended humanities responses, often failing to recognize when a student partially understood a topic. Future versions of ARGOS will explore graph-based representations of student thinking to capture a more human-like and accurate analysis of a student's complete understanding of any topic.
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.