Visual Question Answering via Cross-Modal Retrieval-Augmented Generation of Large Language Model

Liyang Zhang

09:00 〜 09:20

[2O1-GS-3-01] Visual Question Answering via Cross-Modal Retrieval-Augmented Generation of Large Language Model

Liyang Zhang¹, 〇Youyang Ng¹ (1. Kioxia Corporation)

キーワード：Visual Question Answering, Retrieval-Augmented Generation, Cross-Modal

Visual Question Answering (VQA) is the challenging task of taking images and image-related natural language questions as input and generating answers as output. In knowledge-based VQA, the image alone is often insufficient to answer questions. To provide reliable responses, AI models must acquire and ingest relevant external knowledge. However, effectively retrieving and rationally integrating such external knowledge is a challenge. We formulate a VQA approach involves employing a cross-modal retrieval-augmented generation mechanism to a modality-aligned large language model (LLM), with a 4-module pipelining of Retrieve, Generate, Augment and Select (RAGS). In this approach, we use images as queries to retrieve relevant external knowledge, which is then ingested to the modality-aligned LLM to generate answer candidates. In our experiment, we compared various methods for retrieving external knowledge and assessed their effectiveness using OK-VQA dataset. Our findings indicate that strategically applying relevant knowledge improves performance, outperforming strong baseline.

講演PDFパスワード認証
論文PDFの閲覧にはログインが必要です。参加登録者の方は「参加者用ログイン」画面からログインしてください。あるいは論文PDF閲覧用のパスワードを以下にご入力ください。

講演情報

[2O1-GS-3] 知識の利用と共有：

[2O1-GS-3-01] Visual Question Answering via Cross-Modal Retrieval-Augmented Generation of Large Language Model

パスワード