JSAI2024

Presentation information

General Session

General Session » GS-7 Vision, speech media processing

[2C1-GS-7] Language media processing:

Wed. May 29, 2024 9:00 AM - 10:40 AM Room C (Temporary room 1)

座長:西澤直樹((株)東芝)

10:00 AM - 10:20 AM

[2C1-GS-7-04] Diffusion Model Based on Text and Predicate Logic

〇Kota Sueyoshi1, Takashi Matsubara1 (1. Graduate School of Engineering Science, Osaka University)

Keywords:Diffusion Model, image generation

Diffusion models have achieved remarkable results in generating high-quality, diverse, and creative images. However, when it comes to text-based image generation, they often struggle to capture the intended meaning presented in the text. For instance, a desired object may not be generated, an extraneous object might appear, and an adjective may alter objects it was not intended to modify. Moreover, we observed that these models frequently miss relationships indicating possession between objects. In this paper, we introduce Predicated Diffusion, a unified framework to capture users' intentions more accurately. The proposed method does not solely rely on the text encoder, but instead, represents the intended meaning in the text as propositions using predicate logic and treats the pixels in the attention maps as the fuzzy predicates. This enables us to obtain a differentiable loss function that guides the image generation to align with the textual propositions by its minimization. Predicated Diffusion has demonstrated the ability to generate images that are more faithful to various text prompts, as verified by human evaluators and pretrained image-text models.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Password