JSAI2023

Presentation information

Organized Session

Organized Session » OS-21

[2G5-OS-21e] 世界モデルと知能

Wed. Jun 7, 2023 3:30 PM - 5:10 PM Room G (A4)

オーガナイザ:鈴木 雅大、岩澤 有祐、河野 慎、熊谷 亘、松嶋 達也、森 友亮、松尾 豊

4:10 PM - 4:30 PM

[2G5-OS-21e-03] Generating Captions with Multi-level Multimodal Encoder on Image Captioning with Reading Comprehension Tasks

〇Wei Yang1, Arisa Ueda1, Komei Sugiura1 (1. Keio University)

Keywords:Multimodal image captioning, Text-based image captioning, Multimodal attention

Visual scene understanding, such as image captioning, can be considered one of the essential topics in the artificial intelligence (AI) field. Image captioning with reading comprehension tasks as an extension of traditional image captioning is more challenging because the generated caption must be related to the text information in the image, and how to read and comprehend text in the context of an image needs to be studied. In this work, we propose multiple image-related attention blocks with multimodal Optical Character Recognition (OCR) information to model the relationship among the global image, multi-level recognized text, and the detected objects in the image. Our model is validated on the standard dataset TextCaps, and the results show that our model outperforms the baseline methods in terms of all evaluation matrices.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Password