State Recognition based on Large-Scale Vision-Language Model and Evolutionary Computation for Daily Assistive Robots

Kento Kawaharazuka

10:00 AM - 10:20 AM

[3G1-OS-24a-04] State Recognition based on Large-Scale Vision-Language Model and Evolutionary Computation for Daily Assistive Robots

〇Kento Kawaharazuka¹, Yoshiki Obinata¹, Naoaki Kanazawa¹, Kei Okada¹, Masayuki Inaba¹ (1. The University of Tokyo)

Keywords:Large-Scale Vision-Language Model, Robotics, Daily Life

In this study, we conduct environmental state recognition using Visual Question Answering (VQA) in Pre-Trained Vision-Language Models (PTVLM) for daily assistive robots.
In VQA, we integrate results from multiple randomized images and various questions with different forms, articles, state expressions, and wording.
Since each question has different states that it can recognize correctly, we use an appropriate combination of questions by optimizing it with evolutionary computation.
This makes it possible to recognize the states of transparent doors and water, which have been difficult to recognize so far.
We believe that this idea will revolutionize the recognition strategies of robots, since it does not require retraining of the network or programming, and a complex recognizer can be easily constructed by simply providing a set of appropriate questions to a single model.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Presentation information

[3G1-OS-24a] 日常生活知識とAI

[3G1-OS-24a-04] State Recognition based on Large-Scale Vision-Language Model and Evolutionary Computation for Daily Assistive Robots

Password