A Simple Multi-Modality Transfer Learning System for End-to-End Sign Language Summarization
Sign languages are fully-fledged visual languages that are used by over 466 million deaf and hard-of-hearing people worldwide. With its own grammar and lexicon conveyed through manual and non-manual markers, sign languages are not understood by most hearing people and are not supported by communication technologies. Recently, promising progress in sign language recognition and translation have contributed to decreasing the communication barrier. However, little work has been done in downstream sign language processing. Current systems perform downstream tasks through a cascade of models. Specifically, summarizing the meaning of a long sign language video would be achieved via a cascade of sign language recognition and text summarization models. These types of cascading models allow errors to propagate from one task to the next and are computationally inefficient. Instead, we propose to build an end-to-end model which directly generates a summary given the sign language video. The model will be constructed using the How2 and How2Sign datasets. With its simplicity, this model can serve as a solid baseline for future research in downstream sign language processing.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.