JSAI2024

Presentation information

Organized Session

Organized Session » OS-16

[2O6-OS-16a] OS-16

Wed. May 29, 2024 5:30 PM - 6:50 PM Room O (Music studio hall)

オーガナイザ:鈴木 雅大(東京大学)、岩澤 有祐(東京大学)、河野 慎(東京大学)、熊谷 亘(東京大学)、松嶋 達也(東京大学)、森 友亮(株式会社スクウェア・エニックス)、松尾 豊(東京大学)

5:50 PM - 6:10 PM

[2O6-OS-16a-02] Referring Expression Segmentation using Optimal Transport Polygon Matching with Multimodal Foundation Models

〇Takayuki Nishimura1, Katsuyuki Kuyo1, Motonari Kambara1, Komei Sugiura1 (1. Keio University)

Keywords:Domestic Service Task, Referring Expression Segmentation, Polygon Matching Using Optimal Transportation, Multimodal Foundation Model, 3D point cloud

In home environments, where the location of objects frequently changes, it is important for robots to quickly and accurately grasp the latest positions of these objects. Therefore, this study deals with the OSMI-3D task, a task that involves identifying target objects based on instructions given by users. We propose a method for efficiently manipulating objects in a home environment using reference expression segmentation based on 3D point cloud data, utilizing both a visual foundation model and a multimodal LLM. The main novelty of this study is the introduction of a Scene Narrative Module. This module combines the multimodal LLM with existing image feature extractors to extract structural features from images while meditating language. In experiments, our method demonstrated superior performance over traditional baseline methods in terms of mean IoU and precision at 0.5-0.9, confirming its effectiveness in the OSMI-3D task.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Password