A Comparative Study of Gemini 2.5 Pro and Human Evaluative Metrics in Assessing Student Comprehension Across STEM and Humanities Topics through Conversation-based Assessments

Ashwath Muppa; Joel Raj; Raj Vora; Ansh Saini; Mihai Boicu

Authors

Ashwath Muppa Thomas Jefferson High School for Science and Technology, Alexandria, VA
Joel Raj John F. Kennedy Memorial High School, Iselin, NJ
Raj Vora Monroe Township High School, Monroe Township, NJ
Ansh Saini South Brunswick High School, South Brunswick, NJ
Mihai Boicu Department of Information Sciences and Technology, George Mason University, Fairfax, VA

Abstract

To build Argumentation and Graph-based Open Student Modeling Software (ARGOS), an Intelligent Tutoring System (ITS) with a dynamic Open Student Model (OSM), the project utilized Gemini 2.5 Pro as the Large Language Model (LLM) to power the project. To validate the OSM, an experiment was conducted to evaluate whether Gemini can accurately assess a student's mastery through rubric-based grading. The 100-point rubric analyzed the following areas of student performance: accuracy, clear reasoning, conciseness and clarity, relevance, and effort. Eight simulated student conversations, four in algebra and four in history, were submitted line by line to Gemini, along with the system prompt and rubric. The system prompt instructed Gemini to return the student's mastery score for each turn at the very end of the conversation. To establish a baseline for comparison, four human researchers independently scored each conversation using the same rubric. The analysis revealed a strong Pearson correlation coefficient (r = 0.837) between Gemini's overall mastery scores and human averages, with a consistent positive or negative difference between turns. However, Gemini consistently scored students about 10 percent lower on average and showed more inconsistency in its scoring (SD = 14.66). While Gemini performed well on math problems with clearly defined steps, it struggled with open-ended humanities responses, often failing to recognize when a student partially understood a topic. Future versions of ARGOS will explore graph-based representations of student thinking to capture a more human-like and accurate analysis of a student's complete understanding of any topic.

A Comparative Study of Gemini 2.5 Pro and Human Evaluative Metrics in Assessing Student Comprehension Across STEM and Humanities Topics through Conversation-based Assessments

Authors

Abstract

Published

Issue

Section

License

assip