Capítulo de livro Revisado por pares

Deep Recurrent Neural Networks in Speech Synthesis Using a Continuous Vocoder

2017; Springer Science+Business Media; Linguagem: Inglês

10.1007/978-3-319-66429-3_27

ISSN

1611-3349

Autores

Mohammed Salah Al-Radhi, Tamás Gábor Csapó, Géza Németh,

Tópico(s)

Speech and Audio Processing

Resumo

In our earlier work in statistical parametric speech synthesis, we proposed a vocoder using continuous F0 in combination with Maximum Voiced Frequency (MVF), which was successfully used with a feed-forward deep neural network (DNN). The advantage of a continuous vocoder in this scenario is that vocoder parameters are simpler to model than traditional vocoders with discontinuous F0. However, DNNs have a lack of sequence modeling which might degrade the quality of synthesized speech. In order to avoid this problem, we propose the use of sequence-to-sequence modeling with recurrent neural networks (RNNs). In this paper, four neural network architectures (long short-term memory (LSTM), bidirectional LSTM (BLSTM), gated recurrent network (GRU), and standard RNN) are investigated and applied using this continuous vocoder to model F0, MVF, and Mel-Generalized Cepstrum (MGC) for more natural sounding speech synthesis. Experimental results from objective and subjective evaluations have shown that the proposed framework converges faster and gives state-of-the-art speech synthesis performance while outperforming the conventional feed-forward DNN.

Referência(s)