3:40 PM - 4:00 PM
[3Q5-GS-9-01] Semantic Consistency Assessment of Visual and Text Content using Multimodal Deep Neural Networks
Keywords:Multimodal, Deep Learning, Natural Language Processing, Image Recognition, Cross Attention
Semantic consistency assessment of an image and text inside a document is important task because readers refer the image to deepen understanding of text content. In this study, we develop a multimodal deep neural networks for the semantic consistency assessment of the image and the text. We propose a novel approach combines binary classification and angular margin loss to acquire discriminative features. We also clarify contradictions between the image and the text by visualizing cross-attention among objects inside the image and words in text. To show the effectiveness of our approach, we evaluate the accuracy of several models using flickr30k dataset which contains images and their captions. The results show that our proposed model outperforms the existing joint embedding model with 0.9 improvements in F-measure.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.