JSAI2022

Presentation information

General Session

General Session » GS-7 Vision, speech media processing

[1O5-GS-7] Vision, speech media processing: clustering / generation

Tue. Jun 14, 2022 4:20 PM - 6:00 PM Room O (Room 510)

座長:吉田 周平(NEC)[遠隔]

5:40 PM - 6:00 PM

[1O5-GS-7-05] Multimodal conversational response generation based on video understanding module with space-time attention

〇Yoshihiro Yamazaki1, Shota Orihashi1, Ryo Masumura1, Mihiro Uchida1, Akihiko Takashima1 (1. NTT Computer and Data Science Laboratories, NTT Corporation)

[[Online]]

Keywords:Multimodal dialog, Video understanding, End-to-End response generation

Many attempts have been made to build multimodal dialog systems that can answer a question about given audio-visual information, and a representative task for such systems is Audio Visual Scene-Aware Dialog (AVSD). For understanding visual information, most conventional AVSD models adopt a Convolutional Neural Network (CNN)-based video feature extractor. Although CNN tends to extract spacially and temporally local information, global information is also crucial to boost video understanding, since AVSD requires long-term temporal visual dependency and whole visual information. In this study, we apply the video understanding module with space-time attention that can capture spatially and temporally global representations more efficiently than the CNN-based module. We observed that our AVSD model achieved high objective and subjective performance scores for answer generation.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Password