Presentation information

General Session

General Session » GS-7 Vision, speech media processing

[4I1-GS-7b] 画像音声メディア処理:マルチモーダル処理

Fri. Jun 11, 2021 9:00 AM - 10:40 AM Room I (GS room 4)

座長:石原 賢太(NEC)

9:00 AM - 9:20 AM

[4I1-GS-7b-01] Building a Video-and-Language Dataset with Human Actions for Multimodal Inference

〇Mai Yokozeki1, Natsuki Murakami1, Riko Suzuki1, Hitomi Yanaka2,1, Koji Mineshima3, Daisuke Bekki1 (1. Ochanomizu Universitiy, 2. RIKEN, 3. Keio University)

Keywords:Multimodal Inference, Visual Textual Entailment, Video Dataset

This paper introduces a new video-and-language dataset with human actions for multimodal inference.The dataset consists of 200 videos, 5554 action labels, and 1942 action triplets of the form <subject, action, object>. Action labels contain various expressions such as aspectual and intentional phrases that are characteristic of videos but do not appear in existing video and image datasets. The dataset is expected to be applied to the evaluation of the multimodal inference system between the video and semantically complicated sentences such as negation and quantity.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.