Speech Synthesis Based on Deep Neural Networks with Direct Modeling of Amplitude Spectra
Ranniery Maia, Rui Seara

DOI: 10.14209/sbrt.2018.103
Evento: XXXVI Simpósio Brasileiro de Telecomunicações e Processamento de Sinais (SBrT2018)
Keywords: Deep learning deep neural networks speech syn- thesis text-to-speech (TTS) systems
Abstract
In recent state-of-the-art text-to-speech systems, usually a sequence of graphemes is directly mapped onto the speech waveform using deep neural networks. Despite reaching very high quality, these approaches tend to be computationally costly at synthesis time and its training implementation is usually not trivial. In this paper, a method which can be interpreted as a simplified version of these systems is proposed. Here, framebased smoothed log spectra, fundamental frequency, and phase information are modeled at training time, while synthesis runs in a straightforward fashion. Experiments show that the proposed approach outperforms traditional ones using acoustic modeling of speech features.

Download