Comparative Analysis of Learning Data Augmentation Techniques for Speech Emotion Recognition

Kazuya Mera

9:40 AM - 10:00 AM

[2L1-OS-9a-03] Comparative Analysis of Learning Data Augmentation Techniques for Speech Emotion Recognition

〇Kazuya Mera¹, Tsuyoshi Sakane¹, Yoshiaki Kurosawa¹, Takezawa Toshiyuki¹ (1. Hiroshima City University)

Keywords:Speech Emotion Recognition, Data Augumentation, Mel-Spectrogram

Machine learning-based Speech Emotion Recognition (SER) and Emotional Speech Synthesis have gained increasing popularity recently. However, preparing sufficient learning data that perfectly matches the intended use is challenging. One method to increase data volume is “data augmentation.” Various data augmentation methods are proposed in the fields of Automatic Speech Recognition (ASR) and Image Recognition (IR). This paper proposes increasing learning data through data augmentation methods from the ASR and IR fields. Five data augmentation techniques (Time Stretch, Frequency Masking, Time Masking, Frequency Warping, Low-latency Low-resource Voice Conversion (LLVC), and CopyPaste) are applied to machine learning data for SER and their effectiveness is compared. The experimentation results indicated that applying multiple data augmentation methods enhanced the performance of SER. Particularly, the combination of LLVC and CopyPaste improved the SER accuracy by 0.24 points from the baseline.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Presentation information

[2L1-OS-9a] OS-9

[2L1-OS-9a-03] Comparative Analysis of Learning Data Augmentation Techniques for Speech Emotion Recognition

Password