JSAI2024

Presentation information

General Session

General Session » GS-7 Vision, speech media processing

[2C6-GS-7] Language media processing:

Wed. May 29, 2024 5:30 PM - 7:10 PM Room C (Temporary room 1)

座長:寺下直行(株式会社日立製作所)

6:10 PM - 6:30 PM

[2C6-GS-7-03] Action Recognition of Public Spaces Using Multi-Modal Model

〇Masahiro Okano1, Ryuto Yoshida1, Junichiro Fujii1, Shuji Takamori1, Masazumi Amakata1 (1. Yachiyo Engineering Co., Ltd.)

Keywords:Multi-Modal Model, VQA, Action Recognition

In promoting smart cities, there is a demand for the evaluation of the quantity and quality of activities in public spaces. Research on labor-saving measures through AI for assessing the quantity of activities is progressing, but research on labor-saving measures for quality assessment is just beginning. Traditional research on AI models for labor-saving qualitative evaluation of public spaces faced issues such as 1) high model creation costs, and 2) low model versatility, which did not lead to sufficient labor-saving. In response to this problem, this study proposes a method for recognizing actions in public spaces using a multimodal model. A multimodal model is one that integrates multiple data sources and has strengths such as 1) zero model creation cost, and 2) high model versatility. By quantitatively evaluating the performance of the multimodal model for qualitative evaluation using small-scale video data, this study demonstrates the potential for public labor-saving through multimodal models.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Password