Comparing Machine Learning Models Using Part-of-Speech Features for Essay Plagiarism Detection

Authors

  • DANIEL HEALEY
  • MEGHNA SHARMA
  • AYDIN GOCKE
  • Mihai Boicu

DOI:

https://doi.org/10.13021/jssr2020.3164

Abstract

As educational classes move online due to the COVID-19 pandemic, computer-assisted learning has become much more prevalent. However, this reliance on technology has allowed for more opportunities for cheating and plagiarism, especially in essays. Students can submit another student’s essays with slight alterations as their own, yet current technology fails to accurately identify whether a piece of writing belongs to a given student or a ghostwriter. To combat this issue, we evaluated the use of three different machine learning models on their effectiveness of using part-of-speech density to verify authorship. The three models we used were Support Vector Machines (SVM), Random Forest Classifier (RFC), and Multilayer Perceptron (MLP). The data we trained our machine learning models on was from both an online volunteer form and a public essay sharing database, with 1-7 papers each from the seventy-five authors total. We kept one paper for testing and we trained using the remaining papers. For testing, we mixed half of the papers’ authors and kept the other half correctly assigned. The accuracy for RFC and MLP was around 52%, while accuracy for SVM was 56%. We determined that, generally, using only stylometric part-of-speech features is not effective for authorship verification in high school and university level essays due to the small amount of text data that will be available from each student. 

Published

2022-12-13

Issue

Section

College of Engineering and Computing: Department of Information Sciences and Technology

Categories