〇Takayuki Nishimura1, Katsuyuki Kuyo1, Motonari Kambara1, Komei Sugiura1
(1. Keio University)
Keywords:Domestic Service Task, Referring Expression Segmentation, Polygon Matching Using Optimal Transportation, Multimodal Foundation Model, 3D point cloud
In home environments, where the location of objects frequently changes, it is important for robots to quickly and accurately grasp the latest positions of these objects. Therefore, this study deals with the OSMI-3D task, a task that involves identifying target objects based on instructions given by users. We propose a method for efficiently manipulating objects in a home environment using reference expression segmentation based on 3D point cloud data, utilizing both a visual foundation model and a multimodal LLM. The main novelty of this study is the introduction of a Scene Narrative Module. This module combines the multimodal LLM with existing image feature extractors to extract structural features from images while meditating language. In experiments, our method demonstrated superior performance over traditional baseline methods in terms of mean IoU and precision at 0.5-0.9, confirming its effectiveness in the OSMI-3D task.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.