Keywords:Voice, Deep Learning, Speech Corpus
Currently, the necessity of a large-scale speech corpus is increasing, due to development of the end-to-end synthetic speech system. In order to creating a speech corpus from conversational speech, the speech is cut out for each speaker and annotation is performed. Since this task is very burdensome, this study is a basic study of speech corpus creation support. In this study, we propose multiple speakers and a single speaker section classification from dialogue speech as support for speech corpus creation. In particular, we compared section classifications based on phase as well as amplitude as speech features. In addition, the difference between CNN and RNN classification results was discussed. As a result, multi-speaker and single-speaker can be classified by using speech phase information. In addition, we found that the classification accuracy was improved by using CNN.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.