A study of Visual Abductive Reasoning

Haruki Nagasawa

9:20 AM - 9:40 AM

[3E1-GS-2-02] A study of Visual Abductive Reasoning

〇Haruki Nagasawa¹, Yuta Matsumoto¹, Taku Hasegawa², Kyosuke Nishida², Jun Suzuki¹ (1. Tohoku University, 2. NTT Human Informatics Laboratories)

Keywords:Vision & Language, Visual abductive reasoning, image captioning

Humans can reason in an abductive and hypothetical manner which can be extended from a specific part of the image to infer nontrivial situations in the image itself based on experience and knowledge. When we see a person with a plate full of food, for example, we can assume that person is likely hungry, although we do not know the person well. Can computers perform this kind of visual reasoning?
This study uses the Sherlock dataset with two captions: (i) specific cues to regions of interest, such as objects and actions in images, and (ii) information that can be inferred from these cues. We analyze whether nontrivial hypothetical inferences can be generated end-to-end from images using the state-of-the-art image encoder and language model.
We report that the pre-trained vision-language model can generate some hypothetical visual inferences if we fine-tune the model to understand abductively.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Presentation information

[3E1-GS-2] Machine learning

[3E1-GS-2-02] A study of Visual Abductive Reasoning

Password