Deep Voice is a real improvement over other text-to-speech pipelines

Hello! I am back here for one other interesting Blog post at The Information Age. Wednesday calling. Baidu Research recently unveiled one paper from a research effort that aroused my attention. The reason is that it appears to be e real improvement for text-to-speech natural language processing (NLP) over other pipelines like WaveNet.

The link I share today from Baidu Research Blog comments precisely about this development, and this Blog now highlights some of the important point worth to mention (I hope to not pass the impression that I am doing copy paste with these kinds of posts, but bear with me that sometimes most of the content I share is really worth the read… I will try to put more of my own comments, if I deem those relevant, from now on). Any after this, below you can also find the link for the arXived paper with the Abstract, and with some more comments if they are good points worth to make.

Within the website link from Baidu Research there some sound widgets for the demonstrating purpose of the software framework developed called Deep Voice. I encourage everyone to listen to them, but I will skip ahead and not publish it here.

Baidu Research presents Deep Voice, a production-quality text-to-speech system constructed entirely from deep neural networks. The biggest obstacle to building such a system thus far has been the speed of audio synthesis – previous approaches have taken minutes or hours to generate only a few seconds of speech. We solve this challenge and show that we can do audio synthesis in real-time, which amounts to an up to 400X speed-up over previous WaveNet inference implementations.

Synthesizing artificial human speech from text, commonly known as text-to-speech (TTS), is an essential component in many applications such as speech-enabled devices, navigation systems, and accessibility for the visually impaired. Fundamentally, it allows human-technology interaction without requiring visual interfaces.


Deep Voice is inspired by traditional text-to-speech pipelines and adopts the same structure, while replacing all components with neural networks and using simpler features. This makes our system more readily applicable to new datasets, voices, and domains without any manual data annotation or additional feature engineering.

The entire framework isn’t yet full end-to-end pipeline, but I just wonder how the API teams at Baidu might be scratching their heads and applying hard thought and work to the task. But the models and components already built may be a set forward in state-of-the-art NLP systems:

Deep Voice lays the groundwork for truly end-to-end speech synthesis without a complex processing pipeline and without relying on hand-engineered features for inputs or pre-training.

Our current pipeline is not yet end-to-end, and consists of a phoneme model and an audio synthesis component.

We are now at the mentioned widgets section of the blog article. The difficulty in distinguish between natural voice and a more robotic one is still perceptible:

The robotic nature of the voice comes from the pipeline structure and the phoneme model; the audio synthesis component alone generates much more natural clips. The following are clips using the audio synthesis module, but using features from the ground truth audio instead of the phoneme model.

These samples sound very close to the original audio, showing that our audio synthesis component can reproduce human voices very effectively. The following are the ground truth for the utterances above.


Deep learning has revolutionized many fields such as computer vision and speech recognition, and we believe that text-to-speech is now at a similar tipping point. We’re excited to see what the deep learning community can come up with and hope to accelerate that by sharing our entire text-to-speech system in reproducible detail.

The Paper

Now I share the paper and its abstract. Below is an image of the head of research at Baidu, which does not feature as an author in the paper. Nevertheless the list of authors of this paper is quite long, and I just imagine that Andrew Ng wasn’t upset by not being part of the list (he is part of many other long lists of papers, one more one less doesn’t really matter, I believe). Certainly no one is feeling alone doing research at Baidu, that seems guaranteed:

Deep Voice: Real-time Neural Text-to-Speech


We present Deep Voice, a production-quality text-to-speech system constructed entirely from deep neural networks. Deep Voice lays the groundwork for truly end-to-end neural speech synthesis. The system comprises five major building blocks: a segmentation model for locating phoneme boundaries, a grapheme-to-phoneme conversion model, a phoneme duration prediction model, a fundamental frequency prediction model, and an audio synthesis model. For the segmentation model, we propose a novel way of performing phoneme boundary detection with deep neural networks using connectionist temporal classification (CTC) loss. For the audio synthesis model, we implement a variant of WaveNet that requires fewer parameters and trains faster than the original. By using a neural network for each component, our system is simpler and more flexible than traditional text-to-speech systems, where each component requires laborious feature engineering and extensive domain expertise. Finally, we show that inference with our system can be performed faster than real time and describe optimized WaveNet inference kernels on both CPU and GPU that achieve up to 400x speedups over existing implementations.


Two last and final text-to-speech points: this is a production-quality system, meaning that commercialization is within reach sooner rather than later; and finally I raise my eyebrow as to the expression faster that real-time inference performance: I admit to not having read the full details of the paper, bu this expression put one of my fingers scratching my head… See you again soon enough.

Body text image of Andrew Ng: NVIDIA GTC: The Race To Perfect Voice Recognition Using GPUs

featured image: Baidu Deep Voice explained: Part 1 — the Inference Pipeline


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s