4:30 PM - 4:50 PM
[3O5-OS-16c-04] Large-Scale Indoor Search Engine with Multimodal Foundation Models and Relaxing Contrastive Loss
Keywords:Learning to rank
In this paper, we focus on the learning-to-rank physical objects task. In this task, images of objects within large-scale indoor environments are ranked based on open-vocabulary user instructions. We introduce the GREP module to construct visual features considering image, target object, relative positions, and pixel granularities. Additionally, we introduce the RCS module to efficiently learn from redundant images taken in the indoor environment. Our method outperformed baseline methods on the newly constructed YAGAMI dataset and an extended LTRRIE-subset, showing significant improvements in the standard metrics.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.