Towards Empirically-derived Efficacy Metrics for Neural Code-to-Comment Translation
DOI:
https://doi.org/10.13021/jssr2021.3227Abstract
With the growing demand for natural language models capable of automated code summarization, the necessity for robust model evaluation metrics has increased. Current evaluation metrics such as BLEU and ROGUE-L focus on specific aspects of the model such as semantic equivalence and fail to account for other potential errors the model can make. The purpose of this study is to define new quantitative metrics developers can employ when training their models in order to improve performance across all error types as opposed to optimizing for generalized equivalence with ground truth captions. As a first step in this process, we employed a qualitative comparison of four recently proposed code summarization models that involved a rigorous manual classification of the errors each model made when compared to ground truth captions. According to our derived taxonomy of model errors, we then proceeded to derive metrics which addressed the most prevalent error categorizations. These new metrics can be used during model training as a loss function in order to optimize the performance of code summarization tasks. We hope that future research efforts will be able to build upon this work with the ultimate goal of driving the creation of more accurate and reproducible automated code summarization models.
Published
Issue
Section
Categories
License
Copyright (c) 2022 ANISH POTHIREDDY , Kevin Moran
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.