[1Win4-52] Scene Text Aware Multimodal Retrieval for Everyday Objects Based on Crosslingual Visual Prompts
Keywords:Multimodal Retrieval, Multilingual Visual Prompt, Domestic Service Robot
This study explores a task where a robot searches for images containing target objects based on user language queries from a large set of images captured in diverse indoor and outdoor environments.
Both images with and without scene text are considered. For example, when searching with the query, "Pass me the red container of Sun-Maid raisins on the kitchen counter," the model ranks images containing a container labeled "Sun-Maid raisins" on the kitchen counter higher. However, linking visual semantics with scene text is challenging. Additionally, multimodal search requires large-scale, high-speed inference, making it impractical to rely solely on a multimodal large language model (MLLM).
To address this, we introduce a Scene Text Visual Encoder, integrating an Aligned Representation with a narrative representation obtained using an MLLM based on Crosslingual Visual Prompting. Incorporating OCR results into the prompt further reduces hallucination.
Experiments show that the proposed method outperforms multimodal foundation models across multiple benchmarks in standard evaluation metrics for ranking-based learning.
Both images with and without scene text are considered. For example, when searching with the query, "Pass me the red container of Sun-Maid raisins on the kitchen counter," the model ranks images containing a container labeled "Sun-Maid raisins" on the kitchen counter higher. However, linking visual semantics with scene text is challenging. Additionally, multimodal search requires large-scale, high-speed inference, making it impractical to rely solely on a multimodal large language model (MLLM).
To address this, we introduce a Scene Text Visual Encoder, integrating an Aligned Representation with a narrative representation obtained using an MLLM based on Crosslingual Visual Prompting. Incorporating OCR results into the prompt further reduces hallucination.
Experiments show that the proposed method outperforms multimodal foundation models across multiple benchmarks in standard evaluation metrics for ranking-based learning.
Please log in with your participant account.
» Participant Log In