Emotion Recognition based on Multimodal Transformer Encoder using Audio and Text

Masahiko Nihira; Masayasu Atsumi

[4Xin1-23] Emotion Recognition based on Multimodal Transformer Encoder using Audio and Text

〇Masahiko Nihira¹, Masayasu Atsumi¹ (1.Soka university)

Keywords:Emotion recognition, Neural network, Multimodal processing

Emotional expressions are important for recognizing and generating speech with emotion and visualizing emotional psychological states and have been applied in various fields. In this study, we focus on the following three issues: effective multimodal ensemble methods, improvement of feature extraction methods for speech modal and emotion recognition of speech by dividing it into segments. We propose a model for emotion expression recognition based on a cross-modal attention transformer between speech and text features. In this model, speech audio is encoded by wav2vec2.0 and RoBERTa encodes text which is recognized by speech recognition with Whisper. Long speech is divided into segments and the emotional expression is determined based on a consensus of their recognition. The experimental results show the effect of different ensemble methods, the effectiveness of wav2vec2.0 for speech audio encoding, the usefulness of the transcribed text by Whisper and possibility of emotion expression recognition by consensus.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Presentation information

[4Xin1] Poster session 2

[4Xin1-23] Emotion Recognition based on Multimodal Transformer Encoder using Audio and Text

Password