10:20 AM - 10:40 AM
[3D1-GS-2-05] SSM Meets Video Diffusion Models: Efficient Video Generation with Structured State Spaces
Keywords:Video Generation, Diffusion Models, State Space Models
Recent video diffusion models have utilized attention layers to extract temporal features. However, attention layers are limited by their memory consumption, which increases quadratically with sequence length. This limitation presents challenges when attempting to generate longer video sequences. To overcome this challenge, we propose leveraging state-space models (SSMs). SSMs have recently gained attention as viable alternatives due to their linear memory consumption relative to sequence length. In the experiments, we first evaluate our SSM-based model with UCF101. In this scenario, our approach outperforms attention-based models in terms of Fr'echet Video Distance (FVD). In addition, to investigate the potential of SSMs for longer video generation, we perform an experiment using the MineRL Navigate. In this setting, our SSM-based model can save memory consumption for longer sequences, while maintaining competitive FVD scores.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.