JSAI2024

Presentation information

Organized Session

Organized Session » OS-16

[3O1-OS-16b] OS-16

Thu. May 30, 2024 9:00 AM - 10:40 AM Room O (Music studio hall)

オーガナイザ:鈴木 雅大(東京大学)、岩澤 有祐(東京大学)、河野 慎(東京大学)、熊谷 亘(東京大学)、松嶋 達也(東京大学)、森 友亮(株式会社スクウェア・エニックス)、松尾 豊(東京大学)

9:20 AM - 9:40 AM

[3O1-OS-16b-02] Task Success Prediction on Large-Scale Object Manipulation Datasets Based on Multimodal LLMs and Vision-Language Foundation Models

〇Daichi Saito1, Motonari Kambara1, Katsuyuki Kuyo1, Komei Sugiura1 (1. Keio University)

Keywords:Manipulator, Object Manipulation, Vision-and-Language, Multimodal LLM, Success Prediction

For enhancing model performance in object manipulation tasks, high-performance prediction mechanisms for task success are crucial. However, existing methods are still insufficient in performance. Moreover, existing prediction mechanisms are designed to address only specific tasks, making it challenging to accommodate a diverse range of tasks. Therefore, our study aims to develop a task success prediction mechanism that can handle multiple object manipulation tasks. A key novelty of the proposed method is the introduction of the λ-Representation, which preserves all types of visual features: visual charactaristics such as colors and shapes; features aligned with natural language; features structured through natural language. For the experiments, we newly built datasets for task success prediction in object manipulation tasks based on the RT-1 dataset and VLMbench. The results show that the proposed method outperforms all baseline methods in accuracy.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Password