Audio Editing Utilizing the Attention Mechanism and Optimization in Latent Space with Diffusion Models

Tomoki Oya; Yasutaka Nishimura; Masato Taya

[3Win5-99] Audio Editing Utilizing the Attention Mechanism and Optimization in Latent Space with Diffusion Models

〇Tomoki Oya¹, Yasutaka Nishimura¹, Masato Taya¹ (1.KDDI Research, Inc.)

Keywords:Diffusion Models, Audio Editing, Optimization

In recent years, zero-shot editing methods using pretrained diffusion models have gained traction in computer vision, with increasing interest in their application to audio editing. Traditional audio editing approaches leverage cross-attention maps from diffusion models but fail to adapt to the diverse range of prompts, reducing their reliability in practical scenarios. In this study, we propose a novel audio editing framework that operates at the latent space level using the attention mechanism of diffusion models. Our method refines editing through cross-attention manipulation while optimizing the similarity between editing instructions and intermediate edited audio, ensuring precise alignment. We evaluate our approach against traditional methods on multiple audio clips, demonstrating that our framework achieves high editing accuracy while preserving the original structure, outperforming traditional methods in maintaining audio consistency.

Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Presentation information

[3Win5] Poster session 3

[3Win5-99] Audio Editing Utilizing the Attention Mechanism and Optimization in Latent Space with Diffusion Models

Password