9:20 AM - 9:40 AM
[3E1-GS-2-02] A study of Visual Abductive Reasoning
Keywords:Vision & Language, Visual abductive reasoning, image captioning
Humans can reason in an abductive and hypothetical manner which can be extended from a specific part of the image to infer nontrivial situations in the image itself based on experience and knowledge. When we see a person with a plate full of food, for example, we can assume that person is likely hungry, although we do not know the person well. Can computers perform this kind of visual reasoning?
This study uses the Sherlock dataset with two captions: (i) specific cues to regions of interest, such as objects and actions in images, and (ii) information that can be inferred from these cues. We analyze whether nontrivial hypothetical inferences can be generated end-to-end from images using the state-of-the-art image encoder and language model.
We report that the pre-trained vision-language model can generate some hypothetical visual inferences if we fine-tune the model to understand abductively.
This study uses the Sherlock dataset with two captions: (i) specific cues to regions of interest, such as objects and actions in images, and (ii) information that can be inferred from these cues. We analyze whether nontrivial hypothetical inferences can be generated end-to-end from images using the state-of-the-art image encoder and language model.
We report that the pre-trained vision-language model can generate some hypothetical visual inferences if we fine-tune the model to understand abductively.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.