Expansion of Action Recognition Datasets Using Video Caption Generation

Soushi Gotou

10:20 AM - 10:40 AM

[4I1-GS-11-05] Expansion of Action Recognition Datasets Using Video Caption Generation

〇Soushi Gotou^1,2, Sudesna Chakraborty¹, Takeshi Morita^1,2, Yuya Yoshikawa³, Yasunori Yamamoto^2,4, Shusaku Egami², Takanori Ugai^2,5, Kenichiro Fukuda² (1. AoyamaGakuinUniversity, 2. National Institute of Advanced Industrial Science and Technology, 3. Chiba Institute of Technology, 4. Database Center for Life Science (DBCLS), ROIS-DS, 5. Fujitsu Limited)

Keywords:Human Action Recognition, Meta Video Dataset, Action Label Integration, Video Caption Generation, Large Vision Language Model

Existing action recognition datasets typically assign a single representative action label to each video and often fail to annotate multiple actions occurring within the same video comprehensively. Expanding these datasets to allow multiple action labels per video could enhance their utility and improve the accuracy of the action recognition models. This study proposes extending action recognition datasets by leveraging video caption generation with Large Vision Language Models (LVLMs). This approach first generates captions for target videos using an LVLM. Subsequently, a Large Language Model (LLM) is used to extract relevant action labels from the Meta Video Dataset (MetaVD) that corresponds to the generated captions, and assigns them to the target videos. Additionally, action labels with an "equal" relationship to the extracted labels within MetaVD are also assigned to the target videos. To evaluate the proposed method, we use the HMDB51 dataset and computed the recovery rate of HMDB51 action labels based on those assigned by our approach. Furthermore, a human assessment is conducted to validate the accuracy and relevance of the assigned action labels, demonstrating the effectiveness of our proposed method.

Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Presentation information

[4I1-GS-11] AI and Society:

[4I1-GS-11-05] Expansion of Action Recognition Datasets Using Video Caption Generation

Password