Presentation information

Organized Session

Organized Session » OS-9

[2L1-OS-9a] OS-9

Wed. May 29, 2024 9:00 AM - 10:40 AM Room L (Room 52)

オーガナイザ:熊野 史朗(NTT コミュニケーション科学基礎研究所)、日永田 智絵(奈良先端科学技術大学院大学)、森田 純哉(静岡大学)、菅谷 みどり(芝浦工業大学)、鈴木 健嗣(筑波大学)

9:40 AM - 10:00 AM

[2L1-OS-9a-03] Comparative Analysis of Learning Data Augmentation Techniques for Speech Emotion Recognition

〇Kazuya Mera1, Tsuyoshi Sakane1, Yoshiaki Kurosawa1, Takezawa Toshiyuki1 (1. Hiroshima City University)

Keywords:Speech Emotion Recognition, Data Augumentation, Mel-Spectrogram

Machine learning-based Speech Emotion Recognition (SER) and Emotional Speech Synthesis have gained increasing popularity recently. However, preparing sufficient learning data that perfectly matches the intended use is challenging. One method to increase data volume is “data augmentation.” Various data augmentation methods are proposed in the fields of Automatic Speech Recognition (ASR) and Image Recognition (IR). This paper proposes increasing learning data through data augmentation methods from the ASR and IR fields. Five data augmentation techniques (Time Stretch, Frequency Masking, Time Masking, Frequency Warping, Low-latency Low-resource Voice Conversion (LLVC), and CopyPaste) are applied to machine learning data for SER and their effectiveness is compared. The experimentation results indicated that applying multiple data augmentation methods enhanced the performance of SER. Particularly, the combination of LLVC and CopyPaste improved the SER accuracy by 0.24 points from the baseline.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.