JSAI2025

Presentation information

Poster Session

Poster session » Poster Session

[1Win4] Poster session 1

Tue. May 27, 2025 3:30 PM - 5:30 PM Room W (Event hall D-E)

[1Win4-52] Scene Text Aware Multimodal Retrieval for Everyday Objects Based on Crosslingual Visual Prompts

〇Kento Tokura1, Korekata Ryosuke1, Komatsu Takumi1, Yuto Imai1, Komei Sugiura1 (1.Keio University)

Keywords:Multimodal Retrieval, Multilingual Visual Prompt, Domestic Service Robot

This study explores a task where a robot searches for images containing target objects based on user language queries from a large set of images captured in diverse indoor and outdoor environments.
Both images with and without scene text are considered. For example, when searching with the query, "Pass me the red container of Sun-Maid raisins on the kitchen counter," the model ranks images containing a container labeled "Sun-Maid raisins" on the kitchen counter higher. However, linking visual semantics with scene text is challenging. Additionally, multimodal search requires large-scale, high-speed inference, making it impractical to rely solely on a multimodal large language model (MLLM).
To address this, we introduce a Scene Text Visual Encoder, integrating an Aligned Representation with a narrative representation obtained using an MLLM based on Crosslingual Visual Prompting. Incorporating OCR results into the prompt further reduces hallucination.
Experiments show that the proposed method outperforms multimodal foundation models across multiple benchmarks in standard evaluation metrics for ranking-based learning.

Please log in with your participant account.
» Participant Log In