JSAI2022

Presentation information

General Session

General Session » GS-7 Vision, speech media processing

[2O1-GS-7] Vision, speech media processing: generation

Wed. Jun 15, 2022 9:00 AM - 10:40 AM Room O (Room 510)

座長:栗田 修平(理化学研究所)[現地]

9:20 AM - 9:40 AM

[2O1-GS-7-02] Cross-modal Description Generation for Future Events in Daily Tasks

〇Motonari Kambara1, Komei Sugiura1 (1. Keio University)

Keywords:Video captioning, Future captioning, Cross-modal, Relational Self-Attention

In this paper, our aim is to generate a caption about a future event. We propose the Relational Future Captioning Model (RFCM), a crossmodal language generation model for the future captioning task. The RFCM has the Relational Self-Attention Encoder to extract the relationships between events more effectively than the conventional self-attention in transformers. We conducted comparison experiments, and the results show the RFCM outperforms a baseline method on two datasets.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Password