3:00 PM - 3:20 PM
[1H4-OS-17a-03] Multimodal Identification of Cartoons with Vision Transformer and BERT
Keywords:Vision Transformer, Bidirectional Encoder Representations from Transformers, Understanding of Creation , Comic Computing, Multimodal
Against the background of the development of deep learning, research on the understanding and generation of creative works by computers has been actively researched. However, understanding and generating creations are intellectual tasks, and understanding them by computers is a difficult task. In this study, we focus on comics among creative works. Comics are typical multimodal creations, and have recently attracted attention as multimodal data. Since comics are composed of pictures and letters, comic engineering has aspects of image processing and natural language processing. In this field, there are many researches using image processing and natural language models, but there are few researches using image and natural language in a multimodal way. In this study, we solve the problem of work identification using distributed representations of both images and text. We used Manga109 for the cartoon dataset, Vision Transformer (ViT) for the distributed representation of images, and BERT (Bidirectional encoder representations from transformers) for the distributed representation of natural language.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.