JSAI2018

Presentation information

Oral presentation

General Session » [General Session] 10. Vision / Speech

[2N1] [General Session] 10. Vision / Speech

Wed. Jun 6, 2018 9:00 AM - 10:20 AM Room N (2F Sakurajima)

座長:辻川 剛範(NEC)

9:00 AM - 9:20 AM

[2N1-01] Expressive Speech Synthesis through modeling the variety of expressions by Variational Autoencoder

〇Kei Akuzawa1, Iwasawa Yusuke1, Matsuo Yutaka1 (1. University of Tokyo)

Keywords:Expressive Speech Synthesis, Variational Autoencoder, Autoregressive generative models

Recent advancements in the deep autoregressive generative modeling improve the performance of speech synthesis (SS). However, how to equip the expressiveness into the deep autoregressive based SS-system is an open issue due to the lack of ability to model the global characteristics of speech (such as speaker individualities or speaking styles). In this paper, we propose a model called VAE-Loop, which integrates variational autoencoder (VAE) with VoiceLoop: one of the autoregressive based speech synthesis models. Unlike the traditional SS with autoregressive modeling, the proposed method explicitly model the global characteristic of speech by VAE, enabling control of the expressiveness of the synthesized speech. Experiments on VCTK and Blizzard2012 showed that VAE helps VoiceLoop to generate higher quality speech and control expressions through learning the global characteristics.