JSAI2023

Presentation information

General Session

General Session » GS-7 Vision, speech media processing

[1O5-GS-7] Vision, speech media processing

Tue. Jun 6, 2023 5:00 PM - 7:00 PM Room O (E1+E2)

座長:真矢 滋(東芝) [現地]

5:00 PM - 5:20 PM

[1O5-GS-7-01] Searching optimal caption in learning Text-to-Image model

〇Jumpei Nakao1, Masaru Isonuma1,2, Junichiro Mori1,3, Ichiro Sakata1 (1. The University of Tokyo, 2. The University of Edinburgh, 3. RIKEN AIP)

Keywords:Deep learning, Multimodal, Bilevel optimization

Text-to-Image models require datasets consisting of a huge number of image-caption pairs for training.
Since the captions in such datasets are manually annotated, they are not necessarily optimal for training text-to-image models.
In this study, we propose a learning framework that trains Text-to-Image models while optimizing the captions used for training.
Specifically, we introduce a model that outputs pseudo captions from images and alternately update the parameters of the model and the Text-to-Image model through bilevel optimization.
In the experiment, we evaluate the effectiveness of bilevel optimization for learning Text-to-Image models as a preliminary effort.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Password