11:00 AM - 11:20 AM
[3E2-OS-5b-01] Estimating Feedback Responses and the Intensity of Facial Expressions based on Multimodal Information
Keywords:facial expression, action unit, feedback response, multiparty communication, neural networks
Providing feedback to a speaker is an essential communication signal for maintaining a conversation. In addition to verbal feedback responses, facial expressions are also effective modalities to convey the listener's response to the speaker's utterances. Moreover, not only the type of facial expressions, but also the degree of intensity of the expression may influence the meaning of the specific feedback.
In this study, we propose a multimodal deep neural network model that predicts the intensity of facial expressions co-occurring with feedback responses. We collected 33 video-mediated conversations by groups of three people and obtained language, facial and audio data for each participant. We also annotated feedback responses and clustered their BERT-embedding expressions to classify feedback responses. In the proposed method, a decoder with attention mechanism for audio, visual, and language modalities produce the intensity for the 17 AUs frame by frame and a classifier of feedback labels were trained by multi-task learning.
In the evaluation of the prediction performance of the feedback label, there was a bias in the prediction performance depending on the category. For AU intensity prediction, the multi-task model had a smaller loss function value (loss) than the single-task model, indicating a better model.
In this study, we propose a multimodal deep neural network model that predicts the intensity of facial expressions co-occurring with feedback responses. We collected 33 video-mediated conversations by groups of three people and obtained language, facial and audio data for each participant. We also annotated feedback responses and clustered their BERT-embedding expressions to classify feedback responses. In the proposed method, a decoder with attention mechanism for audio, visual, and language modalities produce the intensity for the 17 AUs frame by frame and a classifier of feedback labels were trained by multi-task learning.
In the evaluation of the prediction performance of the feedback label, there was a bias in the prediction performance depending on the category. For AU intensity prediction, the multi-task model had a smaller loss function value (loss) than the single-task model, indicating a better model.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.