5:00 PM - 5:20 PM
[3G5-GS-6-05] Proposal of a Multimodal Emotion Recognition Model Based on the Fusion of Audio and Text
Keywords:Multimodal Emotion Recognition, Transformer, Self-Attention
Multimodal emotion recognition is a technology that integrates multiple modalities—such as audio, text, and images—to more comprehensively and accurately identify and analyze human emotions. In the field of AI-driven dialogue systems, it has become an indispensable technology for facilitating smooth interactions. By fusing data from different modalities, such as audio and text, it is possible to account for inter-modal interactions and correlations that are not captured in single-modal emotion analysis, thereby improving both the generalizability and accuracy of emotion recognition.In this study, we constructed a multimodal emotion analysis model based on the Transformer architecture, which takes audio and text as inputs. By concatenating the outputs of the respective Transformer encoders for audio and text and then applying a Self-Attention mechanism to the concatenated representation, our model can fuse these modalities while preserving their Cross-modal relationships. In this paper, we conduct comparative evaluation experiments against multiple existing methods on CMU-MOSEI, a standard dataset for emotion recognition tasks, to validate the performance of the proposed model and confirm the advantages of multimodal fusion for emotion recognition.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.