11:00 AM - 11:20 AM
[1P1-GS-10-04] Quantifying a Multi-person Meeting based on Multi-modal Micro-behavior Analysis
[[Online]]
Keywords:Online Meeting Quantifying, Multi-modal, Active Speaker Detection
In this paper, we present an end-to-end online meeting quantifying system, which can exactly detect and quantify three micro-behavior indicators, speaking, nodding, and smile, for online meeting evaluation. For active speaker detection (ASD), we build a multi-modal neural network framework which consists of audio and video temporal encoders, audio-visual cross-attention mechanism for inter-modality interaction, and a self-attention mechanism to capture long-term speaking evidence. For nodding detection, based on the WHENet framework proposed in the research field of head pose estimation (HPE), we can estimate the head pitch angles as the nodding feature. Then we build a gated recurrent unit (GRU) network with squeeze-and-excitation (SE) module to recognize nodding movement from videos. Finally, we utilize a Haar cascade classifier for smile detection. The experimental results using K-fold Cross Validation show that the F1-score of each detection module achieves 94.9%, 79.67% and 71.19% respectively.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.