Binaural Audio Generation Using Diffusion Model Conditioned on Visual and 3D Positional Information of the Sound Sources

Haruka Okano

3:00 PM - 3:20 PM

[3N4-GS-7-05] Binaural Audio Generation Using Diffusion Model Conditioned on Visual and 3D Positional Information of the Sound Sources

〇Haruka Okano¹, Ryohei Orihara¹, Yasuyuki Tahara¹, Akihiko Ohsuga¹, Yuichi Sei¹ (1. The University of Electro-Communications)

Keywords:Audio Generation, Multimodal Processing, Image Processing, Diffusion Model

A method for obtaining spatial audio without relying on costly and cumbersome equipment involves binaural audio generation. This process transforms monaural audio into binaural audio that accurately simulates the auditory experience of the human ear, utilizing videos and images as contextual information. In previous studies, regression models were used to predict masks for the complex spectrogram, from which both the amplitude and phase were simultaneously obtained. However, these approaches faced issues with a lack of spatial perception. To address this, our research adopted an approach that uses a diffusion model for generating the amplitude spectrogram and a vocoder for constructing the waveform, which predicts the phase. Additionally, we incorporated features representing the 3D positions of the sound sources into the model, obtained by calculating real-world coordinates from image pixels using depth estimation. The proposed model outperformed the baseline in terms of amplitude component reproduction of binaural audio.

Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Presentation information

[3N4-GS-7] Vision, speech media processing:

[3N4-GS-7-05] Binaural Audio Generation Using Diffusion Model Conditioned on Visual and 3D Positional Information of the Sound Sources

Password