Keywords:multimodal, emotional response, dialogue
It is important for Conversational Agents to display emotional and empathetic responses using verbal and nonverbal behaviors. Towards multimodal generation in Conversational Agents, this study proposes a multimodal deep learning model for predicting a category of verbal acknowledgement and facial expressions at the same time. First, unimodal encoders for audio, language, and face movement were trained, and the output of the encoders were fused to train multimodal decoder which predicts verbal acknowledgement and facial expressions. The model performs much better than a baseline model.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.