wavenet paper

For the second experiment we looked at TTS. We used the same single-speaker speech databases

from which Google’s North American English and Mandarin Chinese TTS systems are built. The

North American English dataset contains 24.6 hours of speech data, and the Mandarin Chinese

dataset contains 34.8 hours; both were spoken by professional female speakers.

WaveNets for the TTS task were locally conditioned on linguistic features which were derived

from input texts. We also trained WaveNets conditioned on the logarithmic fundamental frequency

(log F


) values in addition to the linguistic features. External models predicting log F


values and

phone durations from linguistic features were also trained for each language. The receptive field size

of the WaveNets was 240 milliseconds. As example-based and model-based speech synthesis base-

lines, hidden Markov model (HMM)-driven unit selection concatenative (Gonzalvo et al., 2016) and

long short-term memory recurrent neural network (LSTM-RNN)-based statistical parametric (Zen

et al., 2016) speech synthesizers were built. Since the same datasets and linguistic features were

used to train both the baselines and WaveNets, these speech synthesizers could be fairly compared.

To evaluate the performance of WaveNets for the TTS task, subjective paired comparison tests and

mean opinion score (MOS) tests were conducted. In the paired comparison tests, after listening to

each pair of samples, the subjects were asked to choose which they preferred, though they could

choose “neutral” if they did not have any preference. In the MOS tests, after listening to each

stimulus, the subjects were asked to rate the naturalness of the stimulus in a five-point Likert scale

score (1: Bad, 2: Poor, 3: Fair, 4: Good, 5: Excellent). Please refer to Appendix B for details.

Fig. 5 shows a selection of the subjective paired comparison test results (see Appendix B for the

complete table). It can be seen from the results that WaveNet outperformed the baseline statisti-

cal parametric and concatenative speech synthesizers in both languages. We found that WaveNet

conditioned on linguistic features could synthesize speech samples with natural segmental quality

but sometimes it had unnatural prosody by stressing wrong words in a sentence. This could be due

to the long-term dependency of F


contours: the size of the receptive field of the WaveNet, 240

milliseconds, was not long enough to capture such long-term dependency. WaveNet conditioned on

both linguistic features and F


values did not have this problem: the external F


prediction model

runs at a lower frequency (200 Hz) so it can learn long-range dependencies that exist in F



Table 1 show the MOS test results. It can be seen from the table that WaveNets achieved 5-scale

MOSs in naturalness above 4.0, which were significantly better than those from the baseline systems.

They were the highest ever reported MOS values with these training datasets and test sentences.

The gap in the MOSs from the best synthetic speech to the natural ones decreased from 0.69 to 0.34

(51%) in US English and 0.42 to 0.13 (69%) in Mandarin Chinese

本文由 Easy 发表于GET