JSAI2022

Presentation information

Organized Session

Organized Session » OS-17

[1H4-OS-17a] 創作者と人工知能が創る創作の未来(1/2)

Tue. Jun 14, 2022 2:20 PM - 4:00 PM Room H (Room H)

オーガナイザ:上野 未貴(大阪工業大学)、森 直樹(大阪府立大学)[現地]、はたなかたいち(クリエイターズインパック)

3:00 PM - 3:20 PM

[1H4-OS-17a-03] Multimodal Identification of Cartoons with Vision Transformer and BERT

〇Naoto Aoki1, Naoki Mori1, Makoto Okada1 (1. Osaka Prefecture University)

Keywords:Vision Transformer, Bidirectional Encoder Representations from Transformers, Understanding of Creation , Comic Computing, Multimodal

Against the background of the development of deep learning, research on the understanding and generation of creative works by computers has been actively researched. However, understanding and generating creations are intellectual tasks, and understanding them by computers is a difficult task. In this study, we focus on comics among creative works. Comics are typical multimodal creations, and have recently attracted attention as multimodal data. Since comics are composed of pictures and letters, comic engineering has aspects of image processing and natural language processing. In this field, there are many researches using image processing and natural language models, but there are few researches using image and natural language in a multimodal way. In this study, we solve the problem of work identification using distributed representations of both images and text. We used Manga109 for the cartoon dataset, Vision Transformer (ViT) for the distributed representation of images, and BERT (Bidirectional encoder representations from transformers) for the distributed representation of natural language.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Password