2025年度 人工知能学会全国大会(第39回)

講演情報

オーガナイズドセッション

オーガナイズドセッション » OS-46 生成AIとナレッジグラフ

[3P1-OS-46a] 生成AIとナレッジグラフ

2025年5月29日(木) 09:00 〜 10:40 P会場 (会議室801-2)

オーガナイザ:古崎 晃司(大阪電気通信大学),森田 武史(青山学院大学),黒川 茂莉(KDDI総合研究所),広田 航(ストックマーク)

10:00 〜 10:20

[3P1-OS-46a-03] Chain-of-Thought based Object-level Multimodal Instruction Tuning Data Generation

〇Julio Vizcarra1, Yanan Wang1, Zhi Li1, Hao Niu1, Mori Kurokawa1 (1. KDDI Research)

キーワード:video question answering, video scene graph, instruction data augmentation, MLLM, Knowledge graph

Generating multimodal instruction tuning data with pre-trained LLMs (e.g., Llama 3, GPT-4) has become a standard approach for facilitating multimodal model training (e.g., the LLaVA series). However, recent works tend to generate composition instruction text (e.g., “What did the boy do after he walked towards the woman with a present?”) to make the model struggle to align with the visual context at the various levels (e.g., understanding of low-level features such as object identification and location, and understanding of high-level concepts such as spatial relations and action recognition). Inspired by the concept of chain-of-thought (CoT): step-by-step reasoning to solve complex tasks, in this paper, we proposed a methodology called Chain-Of-Tasks (CoTask) to construct a new multimodal instruction tuning dataset. Our work generated question-answer pairs for a video instruction tuning task (e.g., NExTQA). Our approach can expand the size of the existing dataset by more than 30 times. Furthermore, the generated subtasks cover a diverse range of fine-grained, object-level spatiotemporal reasoning questions, shedding light on ways to improve multimodal model training.

講演PDFパスワード認証
論文PDFの閲覧にはログインが必要です。参加登録者の方は「参加者用ログイン」画面からログインしてください。あるいは論文PDF閲覧用のパスワードを以下にご入力ください。

パスワード