A Comparative Evaluation of Large Language Models for Integration within Argumentation and Graph-based Open Student Modeling Software (ARGOS)

Joel Raj; Ashwath Muppa; Raj Vora; Ansh Saini; Mihai Boicu

doi:10.13021/jssr2025.5164

Authors

Joel Raj John F. Kennedy Memorial High School, Iselin, NJ
Ashwath Muppa Thomas Jefferson High School for Science and Technology, Alexandria, VA
Raj Vora Monroe Township High School, Monroe Township, NJ
Ansh Saini South Brunswick High School, South Brunswick, NJ
Mihai Boicu Department of Information Sciences and Technology, George Mason University, Fairfax, VA

DOI:

https://doi.org/10.13021/jssr2025.5164

Abstract

This project aims to develop Argumentation and Graph-based Open Student Modeling Software (ARGOS), an Intelligent Tutoring System (ITS) that displays a dynamic Open Student Model (OSM) based on real-time conversation. Researchers have explored various conversational agents, assessment-based knowledge displays, and Bayesian Knowledge Tracing (BKT) to model students’ understanding. However, it lacks a system that uses live dialogue to track and update a student’s cognitive progress and understanding in real-time. To determine which Large Language Model (LLM) best suits ARGOS, this study evaluates the abilities of ChatGPT 4o, Gemini 2.5 Pro, Grok 3, Meta AI, and Claude Sonnet 4 to guide student understanding. First, the LLMs were given a system prompt with detailed instructions on how to act as Socratic tutors by asking guiding questions and staying on task. Each model was prompted, one dialogue at a time, with four standardized scenarios involving an imaginary student attempting to factor 12x² + 17x + 6. The scenarios were designed to simulate: (1) providing a correct solution, (2) making a standard conceptual error, (3) making a simple calculation mistake, and (4) going off-topic. Then, four researchers scored the conversations produced by the LLMs using a detailed 30-point rubric that evaluated each model on educational quality, factual accuracy, and instructional following ability. Gemini 2.5 Pro scored the highest, with an average score and standard deviation of 26.625 and 2.55, respectively. Additionally, the one-way ANOVA test resulted in a p-value of 3.05*10^-7, proving the data was statistically significant. Thus, the experiment proved Gemini 2.5 Pro as the most effective LLM, ensuring ARGOS will be built on a capable and reliable tutor for real-time conversation. Further research will focus on testing Gemini 2.5 Pro’s ability to accurately quantify students' mastery scores.

A Comparative Evaluation of Large Language Models for Integration within Argumentation and Graph-based Open Student Modeling Software (ARGOS)

Authors

DOI:

Abstract

Published

Issue

Section

License

assip