Comparative Evaluation of AI Assistants for SQL Education through Prompt Engineering Techniques
Abstract
As large language models become increasingly integrated into education, their role in supporting students’ understanding of structured query language has gained importance. However, the effectiveness of these AI assistants depends heavily on prompt design, particularly in domains like database instruction where precision and structure are essential. This project explores how prompt engineering influences the ability of LLMs to generate accurate, pedagogically valuable SQL responses. To select which models to evaluate, an initial literature review was conducted using multiple benchmark studies comparing LLM performance on SQL-related tasks. Based on metrics such as Execution Accuracy, Exact Match, F1 Score, and Response Quality Score, GPT-4 and Gemini 2.5 were consistently identified as top performing models across independent evaluations. These findings guided their selection for experimental testing in this study.
The experiment tested six prompt templates of varying complexity, using a standardized set of SQL tasks and consistent database schemas. Two researchers independently scored each model’s outputs using a six-criteria rubric: correctness, schema understanding, query logic, pedagogical clarity, assumption transparency, and reproducibility. Scores ranged from 0–5 per criterion, and final scores were averaged across all tasks.GPT-4 achieved a composite score of 29/30, demonstrating consistently high accuracy, clarity, and reusable query patterns. Gemini 2.5 scored 28/30, closely matching GPT-4 but occasionally producing more complex outputs that could pose challenges for novice learners. Both models performed reliably across prompt formats, though small differences emerged in clarity and formatting consistency.
While this study did not evaluate performance in the absence of prompt engineering, this is an area of interest for future research. Both GPT-4 and Gemini 2.5 consistently performed well across all tested prompt structures, reflecting their status as leading AI assistants with very high benchmark scores. The results highlight the importance of prompt clarity and specificity in guiding AI responses, but ultimately, the two models showed only minor differences in output quality. Their scores indicate that either model can effectively support SQL learning, with no significant performance gap between them. In the future, we would like to expand the research by scoring a broader range of AI assistants to gain a more comprehensive understanding of model performance in SQL education. Additionally, involving more researchers in the scoring process would help increase the accuracy and reliability of the evaluation system.
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.