Determining the Optimal LLM for a Hybrid-Model, Conversational Feedback Tool in Educational Dialogue Systems

Advaitha Taire; Rakshith Jangity; Mihai Boicu

doi:10.13021/jssr2025.5158

Authors

Advaitha Taire Neuqua Valley High School, Naperville, IL
Rakshith Jangity John Champe High School, Aldie, VA
Mihai Boicu Department of Information Sciences and Technology, George Mason University, Fairfax, VA

DOI:

https://doi.org/10.13021/jssr2025.5158

Abstract

With the growing demand for accessible, personalized education, Intelligent Tutoring Systems (ITSs) have sought to address individual learners’ needs through adaptive feedback. However, evaluating and generating effective, conversational feedback for free response questions remains a major challenge. Current specialized, individual models have achieved great successes in their respective aims (e.g. response grading, student confidence analysis, conversational responses), yet new ITSs struggle to capitalize on these gains, instead relying on outdated all-in-one systems. This research aims to study a hybrid-model approach, wherein multiple modern, specialized models are leveraged to analyze an open-ended answer, adapt to the student, and deliver conversational feedback effectively. GPT-4o, Gemini 2.0 Flash, DeepSeek R1, TinyLlama, and Claude Sonnet 4 received 10 manually-curated prompts with different school subjects (STEM, Humanities, and the Arts), problems, and student responses, corresponding to the future hybrid-model design. Each LLM’s response is graded independently by two researchers based on a rubric with five categories (Conversationality, Relevance, Factual Accuracy, Ease of Understanding, and Helpfulness) of five points each. Claude Sonnet 4 produced the best results with an average response score of 23.325 (SD=1.700, t-test: pvalueGPT=0.0135, pvalueGemini= 0.0005 ). GPT-4o (scored 22.275) and Gemini 2.0 Flash (scored 21.575) were second with similar performance, while DeepSeek R1 (scored 19.400) and TinyLlama (scored 11.700) ran into significantly lower scores. Claude produces extremely digestible, insightful feedback and hints while still allowing the student to learn firsthand and develop the final answer, making it the most effective LLM to use. Going forward, Claude will be the center of the hybrid-model ITS, with non-LLM models contributing information such as response grades and student confidence.

Determining the Optimal LLM for a Hybrid-Model, Conversational Feedback Tool in Educational Dialogue Systems

Authors

DOI:

Abstract

Published

Issue

Section

License

assip