10:20 AM - 10:40 AM
[4I1-GS-11-05] Expansion of Action Recognition Datasets Using Video Caption Generation
Keywords:Human Action Recognition, Meta Video Dataset, Action Label Integration, Video Caption Generation, Large Vision Language Model
Existing action recognition datasets typically assign a single representative action label to each video and often fail to annotate multiple actions occurring within the same video comprehensively. Expanding these datasets to allow multiple action labels per video could enhance their utility and improve the accuracy of the action recognition models. This study proposes extending action recognition datasets by leveraging video caption generation with Large Vision Language Models (LVLMs). This approach first generates captions for target videos using an LVLM. Subsequently, a Large Language Model (LLM) is used to extract relevant action labels from the Meta Video Dataset (MetaVD) that corresponds to the generated captions, and assigns them to the target videos. Additionally, action labels with an "equal" relationship to the extracted labels within MetaVD are also assigned to the target videos. To evaluate the proposed method, we use the HMDB51 dataset and computed the recovery rate of HMDB51 action labels based on those assigned by our approach. Furthermore, a human assessment is conducted to validate the accuracy and relevance of the assigned action labels, demonstrating the effectiveness of our proposed method.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.