A Comparative Evaluation of Large Language Models for Integration within Argumentation and Graph-based Open Student Modeling Software (ARGOS)
DOI:
https://doi.org/10.13021/jssr2025.5164Abstract
This project aims to develop Argumentation and Graph-based Open Student Modeling Software (ARGOS), an Intelligent Tutoring System (ITS) that displays a dynamic Open Student Model (OSM) based on real-time conversation. Researchers have explored various conversational agents, assessment-based knowledge displays, and Bayesian Knowledge Tracing (BKT) to model students’ understanding. However, it lacks a system that uses live dialogue to track and update a student’s cognitive progress and understanding in real-time. To determine which Large Language Model (LLM) best suits ARGOS, this study evaluates the abilities of ChatGPT 4o, Gemini 2.5 Pro, Grok 3, Meta AI, and Claude Sonnet 4 to guide student understanding. First, the LLMs were given a system prompt with detailed instructions on how to act as Socratic tutors by asking guiding questions and staying on task. Each model was prompted, one dialogue at a time, with four standardized scenarios involving an imaginary student attempting to factor 12x² + 17x + 6. The scenarios were designed to simulate: (1) providing a correct solution, (2) making a standard conceptual error, (3) making a simple calculation mistake, and (4) going off-topic. Then, four researchers scored the conversations produced by the LLMs using a detailed 30-point rubric that evaluated each model on educational quality, factual accuracy, and instructional following ability. Gemini 2.5 Pro scored the highest, with an average score and standard deviation of 26.625 and 2.55, respectively. Additionally, the one-way ANOVA test resulted in a p-value of 3.05*10-7, proving the data was statistically significant. Thus, the experiment proved Gemini 2.5 Pro as the most effective LLM, ensuring ARGOS will be built on a capable and reliable tutor for real-time conversation. Further research will focus on testing Gemini 2.5 Pro’s ability to accurately quantify students' mastery scores.
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.