JSAI2025

Presentation information

General Session

General Session » GS-7 Vision, speech media processing

[3N4-GS-7] Vision, speech media processing:

Thu. May 29, 2025 1:40 PM - 3:20 PM Room N (Room 1009)

座長:砂川 英一(東芝)

3:00 PM - 3:20 PM

[3N4-GS-7-05] Binaural Audio Generation Using Diffusion Model Conditioned on Visual and 3D Positional Information of the Sound Sources

〇Haruka Okano1, Ryohei Orihara1, Yasuyuki Tahara1, Akihiko Ohsuga1, Yuichi Sei1 (1. The University of Electro-Communications)

Keywords:Audio Generation, Multimodal Processing, Image Processing, Diffusion Model

A method for obtaining spatial audio without relying on costly and cumbersome equipment involves binaural audio generation. This process transforms monaural audio into binaural audio that accurately simulates the auditory experience of the human ear, utilizing videos and images as contextual information. In previous studies, regression models were used to predict masks for the complex spectrogram, from which both the amplitude and phase were simultaneously obtained. However, these approaches faced issues with a lack of spatial perception. To address this, our research adopted an approach that uses a diffusion model for generating the amplitude spectrogram and a vocoder for constructing the waveform, which predicts the phase. Additionally, we incorporated features representing the 3D positions of the sound sources into the model, obtained by calculating real-world coordinates from image pixels using depth estimation. The proposed model outperformed the baseline in terms of amplitude component reproduction of binaural audio.

Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Password