JSAI2024

Presentation information

General Session

General Session » GS-2 Machine learning

[3D1-GS-2] Machine learning: Image recognition

Thu. May 30, 2024 9:00 AM - 10:40 AM Room D (Temporary room 2)

座長:金井 関利(日本電信電話株式会社)

10:20 AM - 10:40 AM

[3D1-GS-2-05] SSM Meets Video Diffusion Models: Efficient Video Generation with Structured State Spaces

〇Yuta Oshima1, Shohei Taniguchi1, Masahiro Suzuki1, Yutaka Matsuo1 (1. Graduate School, The University of Tokyo)

Keywords:Video Generation, Diffusion Models, State Space Models

Recent video diffusion models have utilized attention layers to extract temporal features. However, attention layers are limited by their memory consumption, which increases quadratically with sequence length. This limitation presents challenges when attempting to generate longer video sequences. To overcome this challenge, we propose leveraging state-space models (SSMs). SSMs have recently gained attention as viable alternatives due to their linear memory consumption relative to sequence length. In the experiments, we first evaluate our SSM-based model with UCF101. In this scenario, our approach outperforms attention-based models in terms of Fr'echet Video Distance (FVD). In addition, to investigate the potential of SSMs for longer video generation, we perform an experiment using the MineRL Navigate. In this setting, our SSM-based model can save memory consumption for longer sequences, while maintaining competitive FVD scores.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Password