10:00 AM - 10:20 AM
[2Q1-GS-10-04] Multi-Step Spatio-Temporal Reasoning for Open-Ended Video Question Answering
Keywords:Question Answering, Video
This study tackles open-ended video question answering (Video QA), which aims to generate correct text answers according to the questions about video content. Although video consists of sequential frames that involve multiple objects, the methods capturing both spatial and temporal structures are less explored. Furthermore, the existing methods focusing mainly on the temporal structure of video have still achieved competitive performance on public Video QA datasets. However, for more complex and precise reasoning, it is essential to model jointly spatial and temporal structures of video, guided by the content of textual questions. In this paper, we propose the two-stream spatiotemporal MAC network that performs question-aware sequential reasoning over video frames to infer correct answers. Our network computes spatial and temporal representations of video by using clip-wise motion features with frame-wise object-based appearance features and then weights them with spatial and temporal attention to output answers. Moreover, it can repeatedly refine this reasoning process by attending both objects and clips relevant to important words of questions. On both short- and long-form open-ended Video QA datasets, our new model significantly outperforms state-of-the-art Video QA models.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.