JSAI2025

Presentation information

General Session

General Session » GS-7 Vision, speech media processing

[3N4-GS-7] Vision, speech media processing:

Thu. May 29, 2025 1:40 PM - 3:20 PM Room N (Room 1009)

座長:砂川 英一(東芝)

2:40 PM - 3:00 PM

[3N4-GS-7-04] Inference-Time Text-to-Video Alignment with Diffusion Latent Beam Search

〇Yuta Oshima1, Hiroki Furuta1, Masahiro Suzuki1, Yutaka Matsuo1 (1. The University of Tokyo)

Keywords:Video Generation, Diffusion Models, Alignment

Recent advances in video-generation diffusion models have enabled the production of high-resolution videos. Nevertheless, the generated videos frequently exhibit unnatural motions, deformations, reversed playback, or entirely static scenes. There remains considerable scope for improving temporal perceptual quality, and the question of which metrics to optimize—and how—has yet to be resolved. In this paper, we first point out that the improvement of perceptual video quality considering the alignment to prompts requires reward calibration of existing metrics. When visual–language models are used as proxies for human evaluators, existing metrics for quantifying video naturalness do not always correlate well and, moreover, depend on the degree of dynamics described in the evaluation prompts. To address this, we propose a method that employs linear combination across multiple metrics, thereby aligning reward values more closely with perceptual quality. Building on these adjusted rewards, we introduce Diffusion Latent Beam Search, which improves perceptual quality without modifying model parameters and outperforms greedy search and best-of-N sampling within the same computational budget. In addition, we provide practical guidelines concerning how to allocate computation at each diffusion step, as well as how to tune the breadth and depth of the search.

Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Password