Towards Empirically-derived Efficacy Metrics for Neural Code-to-Comment Translation


  • ANISH POTHIREDDY Aspiring Scientists' Summer Internship Program Intern
  • Kevin Moran Aspiring Scientists' Summer Internship Program Mentor



With the growing demand for natural language models capable of automated code summarization, the necessity for robust model evaluation metrics has increased. Current evaluation metrics such as BLEU and ROGUE-L focus on specific aspects of the model such as semantic equivalence and fail to account for other potential errors the model can make. The purpose of this study is to define new quantitative metrics developers can employ when training their models in order to improve performance across all error types as opposed to optimizing for generalized equivalence with ground truth captions. As a first step in this process, we employed a qualitative comparison of four recently proposed code summarization models that involved a rigorous manual classification of the errors each model made when compared to ground truth captions. According to our derived taxonomy of model errors, we then proceeded to derive metrics which addressed the most prevalent error categorizations. These new metrics can be used during model training as a loss function in order to optimize the performance of code summarization tasks. We hope that future research efforts will be able to build upon this work with the ultimate goal of driving the creation of more accurate and reproducible automated code summarization models.





College of Engineering and Computing: Department of Computer Science