10:00 AM - 10:20 AM
[3P1-OS-46a-03] Chain-of-Thought based Object-level Multimodal Instruction Tuning Data Generation
Keywords:video question answering, video scene graph, instruction data augmentation, MLLM, Knowledge graph
Generating multimodal instruction tuning data with pre-trained LLMs (e.g., Llama 3, GPT-4) has become a standard approach for facilitating multimodal model training (e.g., the LLaVA series). However, recent works tend to generate composition instruction text (e.g., “What did the boy do after he walked towards the woman with a present?”) to make the model struggle to align with the visual context at the various levels (e.g., understanding of low-level features such as object identification and location, and understanding of high-level concepts such as spatial relations and action recognition). Inspired by the concept of chain-of-thought (CoT): step-by-step reasoning to solve complex tasks, in this paper, we proposed a methodology called Chain-Of-Tasks (CoTask) to construct a new multimodal instruction tuning dataset. Our work generated question-answer pairs for a video instruction tuning task (e.g., NExTQA). Our approach can expand the size of the existing dataset by more than 30 times. Furthermore, the generated subtasks cover a diverse range of fine-grained, object-level spatiotemporal reasoning questions, shedding light on ways to improve multimodal model training.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.