Improving Retrieval Accuracy of Multimodal RAG Systems by Generating Search Insights from Image-Containing Documents

Taku Fukui

9:00 AM - 9:20 AM

[4Q1-GS-10-01] Improving Retrieval Accuracy of Multimodal RAG Systems by Generating Search Insights from Image-Containing Documents

〇Taku Fukui¹, Satoshi Munakata¹ (1. Fujitsu Limited)

Keywords:AI, RAG, multimodal

To improve business processes, Retrieval-Augmented Generation (RAG) applied to internal documents ideally allows AI to generate insights regarding the intent and purpose of tasks, and then retrieve and answer using relevant documents. However, conventional RAG relies on similarity between query and document embeddings, making it difficult to retrieve information from image-containing documents where such insights are not explicitly stated. Existing Multi-Representation-Indexing methods, which convert image captions into embeddings, also lack this insight generation capability. This study proposes a novel method that generates insight sentences from image-containing documents to enhance retrieval. Documents are decomposed page-by-page; for each page, an image caption and subsequent insight sentences are generated, along with anticipated question-answer pairs. These are then converted into embeddings. Experiments using open datasets demonstrate that incorporating these generated insights improves retrieval accuracy compared to conventional approaches.

Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Presentation information

[4Q1-GS-10] AI application:

[4Q1-GS-10-01] Improving Retrieval Accuracy of Multimodal RAG Systems by Generating Search Insights from Image-Containing Documents

Password