Keywords:Deep Clustering, interpretability, speech separation
Deep clustering (DC) has been shown to perform impressively in various speech separation tasks. The idea is to model and train the process of obtaining an embedding for each time-frequency (TF) bin so that the embeddings for the TF bins dominated by the same source are forced to get close to each other. To further enhance the ability of DC, it is important to make the embedding process interpretable so as to make it easier to analyze and overcome its limitation. Motivated by this, in this paper, we propose modeling the embedding process in DC using a network architecture that can be interpreted as a process of fitting learnable spectrogram templates with non-negative entries to an input spectrogram. The proposed model enables us to visualize and understand the clues according to which the model determines the embeddings when performing separation, while maintaining the performance comparable to the original DC.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.