The Effect of Rubric Language on Essay Scoring Accuracy in LLMs

Authors

  • Vihaan Pol Department of Information Systems and Operations Management, Costello College of Business, George Mason University, Fairfax, VA
  • Si Xie Department of Information Systems and Operations Management, Costello College of Business, George Mason University, Fairfax, VA

DOI:

https://doi.org/10.13021/jssr2025.5344

Abstract

Students and educators alike are increasingly turning to large language models (LLMs), a type of artificial intelligence, to assist with essay scoring and feedback. Unlike human scorers, however, LLMs are sensitive to how rubrics are worded. Variations in rubric design and phrasing can influence grading outcomes, even when the intended meaning remains unchanged. Despite a growing body of research on AI-assisted and automated essay scoring, no existing studies have experimentally tested whether changes in rubric phrasing alone, without altering semantic meaning, can affect LLM grading outcomes. To address this shortcoming, we rescored ten expert-graded AP U.S. History essays, obtained from publicly released College Board samples with included scoring rubrics, using two commonly used generative AI tools (Gemini 2.5 Flash and GPT-4o). Each essay was evaluated by the AI models using the original College Board rubric (verbatim neutral) and two reworded variations: a positively framed (affirming) version and a negatively framed (deficit-oriented) version. Rubric edits were screened with SBERT (cosine similarity ≥ 0.95) to confirm that wording shifts changed tone while preserving semantic content. Neutral rubric framing produced the highest overall exact-match accuracy with human scores (≈74%) and the most stable scoring across both models. Positive framing slightly inflated average scores (+0.2 on a 0–6 scale), while negative framing caused an average score decrease (−1.2 on a 0-6 scale) compared to human scorers. Grading variability also depends on the model: under a negatively framed rubric, Gemini 2.5 Flash lowered scores on mid-range essays (originally scored 3–4) 299% more often than GPT‑4o. The conclusion from our experiments is that LLMs are vulnerable to consistent, directional scoring shifts from rubric changes.

Published

2025-09-25

Issue

Section

Costello College of Business: Department of Information Systems and Operations Management