The Effect of Rubric Language on Essay Scoring Accuracy in LLMs
DOI:
https://doi.org/10.13021/jssr2025.5344Abstract
Students and educators alike are increasingly turning to large language models (LLMs), a type of artificial intelligence, to assist with essay scoring and feedback. Unlike human scorers, however, LLMs are sensitive to how rubrics are worded. Variations in rubric design and phrasing can influence grading outcomes, even when the intended meaning remains unchanged. Despite a growing body of research on AI-assisted and automated essay scoring, no existing studies have experimentally tested whether changes in rubric phrasing alone, without altering semantic meaning, can affect LLM grading outcomes. To address this shortcoming, we rescored ten expert-graded AP U.S. History essays, obtained from publicly released College Board samples with included scoring rubrics, using two commonly used generative AI tools (Gemini 2.5 Flash and GPT-4o). Each essay was evaluated by the AI models using the original College Board rubric (verbatim neutral) and two reworded variations: a positively framed (affirming) version and a negatively framed (deficit-oriented) version. Rubric edits were screened with SBERT (cosine similarity ≥ 0.95) to confirm that wording shifts changed tone while preserving semantic content. Neutral rubric framing produced the highest overall exact-match accuracy with human scores (≈74%) and the most stable scoring across both models. Positive framing slightly inflated average scores (+0.2 on a 0–6 scale), while negative framing caused an average score decrease (−1.2 on a 0-6 scale) compared to human scorers. Grading variability also depends on the model: under a negatively framed rubric, Gemini 2.5 Flash lowered scores on mid-range essays (originally scored 3–4) 299% more often than GPT‑4o. The conclusion from our experiments is that LLMs are vulnerable to consistent, directional scoring shifts from rubric changes.
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.