2:20 PM - 2:40 PM
[1Q3-GS-11-04] Cross-modal BERT : Acquisition of Multimodal Representation and Cross-modal Prediction based on Self-Attention
Keywords:Multimodal Information Processing, Self-Attention, Communication, Symbol Emergence in Robotics, Natural Language Processing
Humans can abstract rich representation from multi-modal information and use it in daily tasks. For instance, object concepts are represented by the combination of vision, sound, tactile, language, etc. During communication between humans, speakers express this information observed by their own sensory organs as linguistic information. At the same time, listeners infer the speakers’ sensation from linguistic information through their knowledge. Therefore, communication agents have to obtain the bidirectionally predictable knowledge from the multi-modal information. We propose a predictable bidirectional model between images and language based on BERT, which employs a hierarchical self-attention structure. The proposed cross-modal BERT was evaluated in a cross-modal prediction task and a multi-modal categorization task. Experimental results showed that the cross-modal BERT acquired rich multi-modal representation and performed cross-modal prediction in both directions. The proposed model also showed higher performance using multi-modal information rather than using a single modality in the category estimation task.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.