Identifying Question-Worthy Entities from Text Paragraphs Using BERT
Generating sufficiently difficult questions to test the knowledge of students in a classroom setting can be time-consuming, so automatic question generation (QG) is needed. Most QG models require a context passage and answer span as inputs. Our research focuses on automatically identifying question-worthy answer spans. To do this, we are using a k-Nearest Neighbor Classifier (KNN) model. We used Google’s BERT (Bidirectional Encoder Representations from Transformers), a pre-trained NLP model, to create embeddings for each document. We created a new dataset using the Stanford Question Answering Dataset (SQuAD). Using Stanford’s Stanza package and the associated pipeline to the CoreNLP package, we identified named entities, which were then split into two groups based on if they were used in the SQuAD answers. Stanza was also used to make new paragraphs with coreferences resolved. The classifier with coreference resolution obtained a maximal accuracy of 66.8% with 45 neighbors, with 9072 out of 13577 entities classified correctly. Without coreference resolution, the maximal accuracy was 66.6% with 47 neighbors, with 9045 out of 13577 entities classified correctly. For a fixed number of neighbors, the classifier with coreference resolution consistently performed better than the one without. Although the results do not support the hypothesis that question-worthy entities can be selected from passages using our approach, they suggest that coreference resolution is promising for future answer identification models and QG models.
Copyright (c) 2022 VARUN VEJALLA, MAANYA SHANKER, TANYA MALIK, Mihai Boicu
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.