Construction and Validation of Action-Conditioned VideoGPT

Koudai Tabata

3:20 PM - 3:40 PM

[1G4-OS-21a-02] Construction and Validation of Action-Conditioned VideoGPT

〇Koudai Tabata^1,6, Junnosuke Kamohara^2,6, Ryosuke Unno^1,6, Makoto Sato^3,6, Taiju Watanabe^4,6, Taiga Kume^5,6, Masahiro Negishi^1,6, Ryo Okada^1,6, Yusuke Iwasawa¹, Yutaka Matsuo¹ (1. The University of Tokyo, 2. Tohoku University, 3. Nara Institute of Science and Technology, 4. Waseda University, 5. Keio University, 6. Matsuo Institute)

Keywords:World Models, Conditioned video prediction

World models acquire external structure based on observations of the external world and can predict the future states of the external world as it changes with the action of the agent. Recent advances in generative models and language models have contributed to multi-modal world models, which are expected to be applied in various domains, including automated driving and robotics. Video prediction is the field that has made progress in terms of high fidelity and long term prediction, and world models have potential applications for acquiring temporal representations. One example of model architecture that has performed well is a combination of Encoder-Decoder based latent variable model for image reconstruction and auto-regressive model for prediction of latent sequence. In this work, we extend a video prediction model called VideoGPT, which uses VQVAE and Image-GPT by introducing action conditioning. Validation with CARLA and RoboNet showed improved performance compared to the model without conditioning.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Presentation information

[1G4-OS-21a] 世界モデルと知能

[1G4-OS-21a-02] Construction and Validation of Action-Conditioned VideoGPT

Password