Expressive Speech Synthesis through modeling the variety of expressions by Variational Autoencoder

Kei Akuzawa

9:00 AM - 9:20 AM

[2N1-01] Expressive Speech Synthesis through modeling the variety of expressions by Variational Autoencoder

〇Kei Akuzawa¹, Iwasawa Yusuke¹, Matsuo Yutaka¹ (1. University of Tokyo)

Keywords:Expressive Speech Synthesis, Variational Autoencoder, Autoregressive generative models

Recent advancements in the deep autoregressive generative modeling improve the performance of speech synthesis (SS). However, how to equip the expressiveness into the deep autoregressive based SS-system is an open issue due to the lack of ability to model the global characteristics of speech (such as speaker individualities or speaking styles). In this paper, we propose a model called VAE-Loop, which integrates variational autoencoder (VAE) with VoiceLoop: one of the autoregressive based speech synthesis models. Unlike the traditional SS with autoregressive modeling, the proposed method explicitly model the global characteristic of speech by VAE, enabling control of the expressiveness of the synthesized speech. Experiments on VCTK and Blizzard2012 showed that VAE helps VoiceLoop to generate higher quality speech and control expressions through learning the global characteristics.

Presentation information

[2N1] [General Session] 10. Vision / Speech

[2N1-01] Expressive Speech Synthesis through modeling the variety of expressions by Variational Autoencoder