JSAI2024

Presentation information

Poster Session

Poster session » Poster session

[4Xin2] Poster session 2

Fri. May 31, 2024 12:00 PM - 1:40 PM Room X (Event hall 1)

[4Xin2-52] Unedited Fixed-viewpoint Procedural Videos with Language Resources for Understanding Cooking Activities

〇Atsushi Hashimoto1, Koki Maeda1,3, Tosho Hirasawa1,4, Jun Harashima2, Leszek Rybicki2, Yusuke Fukasawa2, Yoshitaka Ushiku1 (1.OMRON SINIC X Corp., 2.Cookpad Inc., 3.Tokyo Institute of Technology, 4.Tokyo Metropolitan University)

Keywords:Vision and Language, Procedural Text Understanding, Video Analysis

Large-scale web videos have contributed significantly to recent progress in video analysis techniques. At the same time, the domain gap between web and unedited videos still limits vision-language applications to those between text and edited videos. Ego vision datasets are actively collected to overcome such problems; as another format of unedited videos, this paper provides a dataset with fixed-viewpoint unedited videos (FV videos). We can effortlessly obtain FV videos with commercial smartphones. We collected 145 videos, a total of 40 hours of footage, in which participants prepare foods based on given recipes. We manually add action graphs that tie videos and procedural texts while identifying the workflow of the process. In addition, we propose two benchmark tasks on this dataset: online recipe retrieval (OnRR) and dense video captioning on FV videos (DVC-FV). Experimental results demonstrated that recent SoTA methods can not solve OnRR and DVC-FV trivially.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Password