Name: Nikhil Saini
Dept: Computer Sci. & Eng.
Program: M.Tech (3rd Year)
Name of supervisor: Prof. Preethi Jyothi & Prof. Pushpak Bhattacharyya
Name: Drumil Trivedi
Dept: Computer Sci. & Eng.
Program: B.Tech (4th Year)
Name of supervisor: Prof. Preethi Jyothi and Prof. Pushpak Bhattacharyya
Names: Shreya Khare, Tejas I. Dhamecha and Samarth Bharadwaj
Spoken language is different from the written language in its style and structure. Disfluencies that appear in transcriptions from speech recognition systems generally hamper the performance of downstream NLP tasks. Thus, a disfluency correction system that converts disfluent to fluent text is of great value. This paper introduces a disfluency correction model that translates disfluent to fluent text by drawing inspiration from recent encoder-decoder unsupervised style-transfer models for text. We also show considerable benefits in performance when utilizing a small sample of 500 parallel disfluent-fluent sentences in a semi-supervised way. Our unsupervised approach achieves a BLEU score of 79.39 on the Switchboard corpus test set, with further improvement to a BLEU score of 85.28 with semi - supervised. Both are comparable to two competitive fully supervised models.
Importance & Applications of the Research
Disfluencies are disruptions to the regular flow of speech, typically occurring in conversational speech, i.e., when someone is speaking unplanned rather than reading. They include filler pauses such as uh and um, word repetitions, irregular elongations, discourse markers, conjunctions, and restarts. For example, the disfluent sentence “well we’re actually uh we’re getting ready” has its fluent form as “we’re getting ready”. Here, the words highlighted in green (well, actually), blue (uh), and red (we’re) refer to discourse, filler, and restart disfluencies, respectively.
Disfluencies affect around 6 percent of the words and occur in both human-human and human-computer interactions. In voice assistant systems, like Siri, Google Assistant, Amazon Alexa, an automatic speech recognition system transcribes speech as is, i.e., retains the disfluencies. The downstream tasks will benefit if the transcriptions are fluent. In cascaded Speech-to-Text Machine Translation systems, the MT systems would help if the disfluencies were removed from the ASR transcriptions.
Novelty & Methodologies
Our main contributions are as follows:
Results and Analysis
Table 1 shows BLEU and METEOR scores between the gold fluent and the disfluency corrected output from five different models. The BLEU score using original disfluent text as the hypothesis is 71.53. The two supervised skylines use 55K pairs of parallel disfluent-fluent sentences during training and yield up to 90 BLEU score. In comparison, the unsupervised approach yields up to 80 BLEU scores without any parallel data. Fine-tuning the unsupervised model with a small parallel corpus containing only 554 pairs (i.e., two orders of magnitude smaller than the complete set of 55K pairs) significantly bridges this gap and yields up 85 BLEU score. In terms of METEOR, the score using original disfluent text as the hypothesis is 57.19. Thus, the difference between unsupervised and supervised approaches is much smaller, indicating that these approaches perform at par with respect to the adequacy or content preservation. These results also show that the last few additional BLEU points (i.e., the difference between BART and SS) come at a high cost with having to create a large parallel corpus. We obtain 77.34 and 77.97 BLEU on the dev and test sets using binary embeddings, respectively, whereas the disfluency-type classifier embedding yields 78.72 and 76.90 on the dev and test sets. Thus, the classifier embeddings do marginally improve performance. However, the BLEU scores obtained using the binary embeddings are almost comparable, which shows that our proposed model can effectively use non-parallel text without any disfluency type labels.
Table 2 shows examples using five different models and the corresponding disfluent and fluent sentences. All five models can remove simple disfluencies (e.g., fillers and discourse) in shorter sentences. In addition, conjunctions and repetitions are removed by all models except the unsupervised BiLSTM model. The third example shows how the transformer model is much better than the BiLSTM model in terms of content retention and adequacy. It also highlights better fluency of the semi-supervised model compared to the unsupervised model.
We propose an unsupervised disfluency correction model drawing motivation from prior work on unsupervised machine translation and style transfer. We investigate two kinds of domain embeddings for our model. We also present a semi-supervised disfluency correction approach. Finally, we finetune our model using only about 500 parallel sentences, which comes very close in performance (based on BLEU scores) to a state-of-the-art, fully supervised system. We intend to explore how these techniques can be integrated more closely with spoken translation in future work.
Recognitions of the work:
Nikhil Saini, Drumil Trivedi, Shreya Khare, Tejas Dhamecha, Preethi Jyothi, Samarth Bharadwaj, and Pushpak Bhattacharyya, 2021. Disfluency Correction using Unsupervised and Semi-supervised Learning. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 3421–3427