Identifying Question-Worthy Entities from Text Paragraphs Using BERT

Authors

  • VARUN VEJALLA
  • MAANYA SHANKER
  • TANYA MALIK
  • Mihai Boicu

DOI:

https://doi.org/10.13021/jssr2020.3166

Abstract

Generating sufficiently difficult questions to test the knowledge of students in a classroom setting can be time-consuming, so automatic question generation (QG) is needed. Most QG models require a context passage and answer span as inputs. Our research focuses on automatically identifying question-worthy answer spans. To do this, we are using a k-Nearest Neighbor Classifier (KNN) model. We used Google’s BERT (Bidirectional Encoder Representations from Transformers), a pre-trained NLP model, to create embeddings for each document. We created a new dataset using the Stanford Question Answering Dataset (SQuAD). Using Stanford’s Stanza package and the associated pipeline to the CoreNLP package, we identified named entities, which were then split into two groups based on if they were used in the SQuAD answers. Stanza was also used to make new paragraphs with coreferences resolved. The classifier with coreference resolution obtained a maximal accuracy of 66.8% with 45 neighbors, with 9072 out of 13577 entities classified correctly. Without coreference resolution, the maximal accuracy was 66.6% with 47 neighbors, with 9045 out of 13577 entities classified correctly. For a fixed number of neighbors, the classifier with coreference resolution consistently performed better than the one without. Although the results do not support the hypothesis that question-worthy entities can be selected from passages using our approach, they suggest that coreference resolution is promising for future answer identification models and QG models. 

Published

2022-12-13

Issue

Section

College of Engineering and Computing: Department of Information Sciences and Technology

Categories