Multimodal Identification of Cartoons with Vision Transformer and BERT

Naoto Aoki

3:00 PM - 3:20 PM

[1H4-OS-17a-03] Multimodal Identification of Cartoons with Vision Transformer and BERT

〇Naoto Aoki¹, Naoki Mori¹, Makoto Okada¹ (1. Osaka Prefecture University)

Keywords:Vision Transformer, Bidirectional Encoder Representations from Transformers, Understanding of Creation , Comic Computing, Multimodal

Against the background of the development of deep learning, research on the understanding and generation of creative works by computers has been actively researched. However, understanding and generating creations are intellectual tasks, and understanding them by computers is a difficult task. In this study, we focus on comics among creative works. Comics are typical multimodal creations, and have recently attracted attention as multimodal data. Since comics are composed of pictures and letters, comic engineering has aspects of image processing and natural language processing. In this field, there are many researches using image processing and natural language models, but there are few researches using image and natural language in a multimodal way. In this study, we solve the problem of work identification using distributed representations of both images and text. We used Manga109 for the cartoon dataset, Vision Transformer (ViT) for the distributed representation of images, and BERT (Bidirectional encoder representations from transformers) for the distributed representation of natural language.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Presentation information

[1H4-OS-17a] 創作者と人工知能が創る創作の未来(1/2)

[1H4-OS-17a-03] Multimodal Identification of Cartoons with Vision Transformer and BERT

Password