6:00 PM - 6:20 PM
[3N6-GS-7-02] A Preliminary Study on Behavioral Analysis Using Vision and Language Foundation Model for Automobile Assembly Work Videos
[[Online]]
Keywords:Multimodal Foundation Model, Behavioral Analysis, Temporal Action Segmentation, Natural Language Processing, Video image processing
There is a growing demand for behavior analysis of workers in automobile manufacturing to automate the monitoring of compliance with work procedures and the measurement of each task's duration. Previous methods using deep neural networks for behavior analysis require frame-by-frame labels of videos for training through supervised learning, resulting in a shortage of labeled data becoming a significant challenge. On the other hand, in recent years, Vision and Language Models (VLMs), which acquire shared embeddings between images and text through large-scale pretraining, have attracted attention as a type of foundation model. By leveraging VLMs, it is becoming possible to build models more efficiently, even in domains that traditionally required large amounts of labeled training data. Therefore, this study proposes a method utilizing the language modality by applying CLIP (Contrastive Language-Image Pre-training), one of representative VLMs, to behavior analysis in automobile assembly videos. In particular, this study verifies whether leveraging the language modality enables the construction of a model with a small amount of labeled training data.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.