5:00 PM - 5:20 PM
[1O5-GS-7-01] Searching optimal caption in learning Text-to-Image model
Keywords:Deep learning, Multimodal, Bilevel optimization
Text-to-Image models require datasets consisting of a huge number of image-caption pairs for training.
Since the captions in such datasets are manually annotated, they are not necessarily optimal for training text-to-image models.
In this study, we propose a learning framework that trains Text-to-Image models while optimizing the captions used for training.
Specifically, we introduce a model that outputs pseudo captions from images and alternately update the parameters of the model and the Text-to-Image model through bilevel optimization.
In the experiment, we evaluate the effectiveness of bilevel optimization for learning Text-to-Image models as a preliminary effort.
Since the captions in such datasets are manually annotated, they are not necessarily optimal for training text-to-image models.
In this study, we propose a learning framework that trains Text-to-Image models while optimizing the captions used for training.
Specifically, we introduce a model that outputs pseudo captions from images and alternately update the parameters of the model and the Text-to-Image model through bilevel optimization.
In the experiment, we evaluate the effectiveness of bilevel optimization for learning Text-to-Image models as a preliminary effort.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.