JSAI2025

Presentation information

Organized Session

Organized Session » OS-46

[3P1-OS-46a] OS-46

Thu. May 29, 2025 9:00 AM - 10:40 AM Room P (Room 801-2)

オーガナイザ:古崎 晃司(大阪電気通信大学),森田 武史(青山学院大学),黒川 茂莉(KDDI総合研究所),広田 航(ストックマーク)

10:00 AM - 10:20 AM

[3P1-OS-46a-03] Chain-of-Thought based Object-level Multimodal Instruction Tuning Data Generation

〇Julio Vizcarra1, Yanan Wang1, Zhi Li1, Hao Niu1, Mori Kurokawa1 (1. KDDI Research)

Keywords:video question answering, video scene graph, instruction data augmentation, MLLM, Knowledge graph

Generating multimodal instruction tuning data with pre-trained LLMs (e.g., Llama 3, GPT-4) has become a standard approach for facilitating multimodal model training (e.g., the LLaVA series). However, recent works tend to generate composition instruction text (e.g., “What did the boy do after he walked towards the woman with a present?”) to make the model struggle to align with the visual context at the various levels (e.g., understanding of low-level features such as object identification and location, and understanding of high-level concepts such as spatial relations and action recognition). Inspired by the concept of chain-of-thought (CoT): step-by-step reasoning to solve complex tasks, in this paper, we proposed a methodology called Chain-Of-Tasks (CoTask) to construct a new multimodal instruction tuning dataset. Our work generated question-answer pairs for a video instruction tuning task (e.g., NExTQA). Our approach can expand the size of the existing dataset by more than 30 times. Furthermore, the generated subtasks cover a diverse range of fine-grained, object-level spatiotemporal reasoning questions, shedding light on ways to improve multimodal model training.

Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Password