Multimodal conversational response generation based on video understanding module with space-time attention

Yoshihiro Yamazaki

5:40 PM - 6:00 PM

[1O5-GS-7-05] Multimodal conversational response generation based on video understanding module with space-time attention

〇Yoshihiro Yamazaki¹, Shota Orihashi¹, Ryo Masumura¹, Mihiro Uchida¹, Akihiko Takashima¹ (1. NTT Computer and Data Science Laboratories, NTT Corporation)

[[Online]]

Keywords:Multimodal dialog, Video understanding, End-to-End response generation

Many attempts have been made to build multimodal dialog systems that can answer a question about given audio-visual information, and a representative task for such systems is Audio Visual Scene-Aware Dialog (AVSD). For understanding visual information, most conventional AVSD models adopt a Convolutional Neural Network (CNN)-based video feature extractor. Although CNN tends to extract spacially and temporally local information, global information is also crucial to boost video understanding, since AVSD requires long-term temporal visual dependency and whole visual information. In this study, we apply the video understanding module with space-time attention that can capture spatially and temporally global representations more efficiently than the CNN-based module. We observed that our AVSD model achieved high objective and subjective performance scores for answer generation.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Presentation information

[1O5-GS-7] Vision, speech media processing: clustering / generation

[1O5-GS-7-05] Multimodal conversational response generation based on video understanding module with space-time attention

Password