For the second experiment we looked at TTS. We used the same single-speaker speech databases
from which Google’s North American English and Mandarin Chinese TTS systems are built. The
North American English dataset contains 24.6 hours of speech data, and the Mandarin Chinese
dataset contains 34.8 hours; both were spoken by professional female speakers.
WaveNets for the TTS task were locally conditioned on linguistic features which were derived
from input texts. We also trained WaveNets conditioned on the logarithmic fundamental frequency
) values in addition to the linguistic features. External models predicting log F
phone durations from linguistic features were also trained for each language. The receptive ﬁeld size
of the WaveNets was 240 milliseconds. As example-based and model-based speech synthesis base-
lines, hidden Markov model (HMM)-driven unit selection concatenative (Gonzalvo et al., 2016) and
long short-term memory recurrent neural network (LSTM-RNN)-based statistical parametric (Zen
et al., 2016) speech synthesizers were built. Since the same datasets and linguistic features were
used to train both the baselines and WaveNets, these speech synthesizers could be fairly compared.
To evaluate the performance of WaveNets for the TTS task, subjective paired comparison tests and
mean opinion score (MOS) tests were conducted. In the paired comparison tests, after listening to
each pair of samples, the subjects were asked to choose which they preferred, though they could
choose “neutral” if they did not have any preference. In the MOS tests, after listening to each
stimulus, the subjects were asked to rate the naturalness of the stimulus in a ﬁve-point Likert scale
score (1: Bad, 2: Poor, 3: Fair, 4: Good, 5: Excellent). Please refer to Appendix B for details.
Fig. 5 shows a selection of the subjective paired comparison test results (see Appendix B for the
complete table). It can be seen from the results that WaveNet outperformed the baseline statisti-
cal parametric and concatenative speech synthesizers in both languages. We found that WaveNet
conditioned on linguistic features could synthesize speech samples with natural segmental quality
but sometimes it had unnatural prosody by stressing wrong words in a sentence. This could be due
to the long-term dependency of F
contours: the size of the receptive ﬁeld of the WaveNet, 240
milliseconds, was not long enough to capture such long-term dependency. WaveNet conditioned on
both linguistic features and F
values did not have this problem: the external F
runs at a lower frequency (200 Hz) so it can learn long-range dependencies that exist in F
Table 1 show the MOS test results. It can be seen from the table that WaveNets achieved 5-scale
MOSs in naturalness above 4.0, which were signiﬁcantly better than those from the baseline systems.
They were the highest ever reported MOS values with these training datasets and test sentences.
The gap in the MOSs from the best synthetic speech to the natural ones decreased from 0.69 to 0.34
(51%) in US English and 0.42 to 0.13 (69%) in Mandarin Chinese