JSAI2024

Presentation information

General Session

General Session » GS-5 Language media processing

[4N3-GS-6] Language media processing:

Fri. May 31, 2024 2:00 PM - 3:40 PM Room N (Room 54)

座長:田中涼太(NTT人間情報研究所)

3:20 PM - 3:40 PM

[4N3-GS-6-05] Caption Generation Reflecting User Intent Through a Diffusion Model

〇Satoko Hirano1, Ichiro Kobayashi1 (1. Ochanomizu University)

Keywords:Diffusion Process, Caption Generation

In recent years, generative models using a diffusion process have achieved state-of-the-art performance in the information processing in continuous space, and have also been actively studied in discrete data generation. This research is working on image caption generation which is a controllable natural language processing task using a diffusion language model. The aim is to develop an image captioning method that reflects not only the information obtained from the image but also the user's intention, which is estimated from the trajectory the user traces over the image. The user's level of interest in the object is determined from the time spent in the trace, and the object in the image is explained according to the order of each user's trace, realizing interactive caption generation. The experiments show that the proposed method is able to express the user's intention estimated from the trace in a generated sentence nonauto-regressively by using a diffusion process.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Password