Keywords:automatic speech recognition, low-resource languages, unsupervised learning, zero-resource, graph neural networks
Zero resource speech technology aims for discovering discrete units in a limited amount of unannotated, raw speech data. The previous studies have mainly focused on learning the discrete units from acoustic features, segmented by fixed small time-frame. While achieving high unit quality, they suffer from high bitrate due to the time-frame encoding. In this work, in order to lower the bitrate, we propose a novel approach based on discrete autoencoder and graph convolutional networks. We exploit the speech features discretized by vector-quantization encoding. Since the maximum number of the discretized features is predetermined, we consider a directed graph where each node represents a discretized acoustic feature and each edge transition from one feature to another. Using graph convolution, we extract and encode the topological feature of the graph into each node, and then we symmetrize the graph to apply spectral clustering on the node features. In terms of ABX error rate and bit rate estimation, we demonstrate that our model successfully decreases the bitrate, while retaining the unit quality.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.