Proposal of a Multimodal Emotion Recognition Model Based on the Fusion of Audio and Text

Yue Tan

5:00 PM - 5:20 PM

[3G5-GS-6-05] Proposal of a Multimodal Emotion Recognition Model Based on the Fusion of Audio and Text

〇Yue Tan¹, Jiazheng Zhou¹, Kazuyuki Matsumoto², Xin Kang², Minoru Yoshida² (1. Division of Science and Technology,Graduate School of Science and Technology for Innovation,Tokushima University, 2. Tokushima University,Graduate School of Technology, Industrial and Social Sciences)

Keywords:Multimodal Emotion Recognition, Transformer, Self-Attention

Multimodal emotion recognition is a technology that integrates multiple modalities—such as audio, text, and images—to more comprehensively and accurately identify and analyze human emotions. In the field of AI-driven dialogue systems, it has become an indispensable technology for facilitating smooth interactions. By fusing data from different modalities, such as audio and text, it is possible to account for inter-modal interactions and correlations that are not captured in single-modal emotion analysis, thereby improving both the generalizability and accuracy of emotion recognition.In this study, we constructed a multimodal emotion analysis model based on the Transformer architecture, which takes audio and text as inputs. By concatenating the outputs of the respective Transformer encoders for audio and text and then applying a Self-Attention mechanism to the concatenated representation, our model can fuse these modalities while preserving their Cross-modal relationships. In this paper, we conduct comparative evaluation experiments against multiple existing methods on CMU-MOSEI, a standard dataset for emotion recognition tasks, to validate the performance of the proposed model and confirm the advantages of multimodal fusion for emotion recognition.

Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Presentation information

[3G5-GS-6] Language media processing:

[3G5-GS-6-05] Proposal of a Multimodal Emotion Recognition Model Based on the Fusion of Audio and Text

Password