3:00 PM - 3:20 PM
[4C3-GS-11-04] Multilingual Visual-Textual Entailment Benchmark with Diverse Linguistic Phenomena
Keywords:Multilingual Multimodal Inference, Natural Language Inference, Multimodal Model
Multimodal models, which combine information from multiple modalities such as images and text and reason about them, have been proposed recently and have achieved high performance in various multimodal reasoning tasks. In this study, we focus on one such task, Visual-Textual Entailment (VTE), which predicts the entailment relationship between images and sentences. VTE is suitable for measuring a model's multimodal reasoning ability because solving VTE tasks requires understanding the information in images and the meaning of sentences and combining them for reasoning. The extent to which multimodal models capture linguistic phenomena such as quantity and negation in sentences and their reasoning ability in languages other than English have yet to be thoroughly evaluated. Therefore, this study proposes two multilingual VTE benchmarks focusing on linguistic phenomena. We evaluate two multimodal models using the proposed benchmarks. The results showed that the models have challenges in reasoning in Japanese compared to other languages and that there is room for improvement in understanding quantity and negation in sentences.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.