Deep learning speech synthesis

current hub

Write something...

Be the first to start a discussion here.

Recent from talks

Be the first to start a discussion here.

Recent from talks

Be the first to start a discussion here.

About hubStatsRules

See all

Wikipedia

Deep learning speech synthesis

Deep learning speech synthesis refers to the application of deep learning models to generate natural-sounding human speech from written text (text-to-speech) or spectrum (vocoder). Deep neural networks are trained using large amounts of recorded speech and, in the case of a text-to-speech system, the associated labels and/or input text.

Given an input text or some sequence of linguistic units $Y$ , the target speech $X$ can be derived by

$X=\arg \max P(X|Y,\theta )$

where $\theta$ is the set of model parameters.

Typically, the input text will first be passed to an acoustic feature generator, then the acoustic features are passed to the neural vocoder. For the acoustic feature generator, the loss function is typically L1 loss (Mean Absolute Error, MAE) or L2 loss (Mean Square Error, MSE). These loss functions impose a constraint that the output acoustic feature distributions must be Gaussian or Laplacian. In practice, since the human voice band ranges from approximately 300 to 4000 Hz, the loss function will be designed to have more penalty on this range:

$loss=\alpha {\text{loss}}_{\text{human}}+(1-\alpha ){\text{loss}}_{\text{other}}$

where ${\text{loss}}_{\text{human}}$ is the loss from human voice band and $\alpha$ is a scalar, typically around 0.5. The acoustic feature is typically a spectrogram or Mel scale. These features capture the time-frequency relation of the speech signal, and thus are sufficient to generate intelligent outputs. The Mel-frequency cepstrum feature used in the speech recognition task is not suitable for speech synthesis, as it reduces too much information.

In September 2016, DeepMind released WaveNet, which demonstrated that deep learning-based models are capable of modeling raw waveforms and generating speech from acoustic features like spectrograms or mel-spectrograms. Although WaveNet was initially considered to be computationally expensive and slow to be used in consumer products at the time, a year after its release, DeepMind unveiled a modified version of WaveNet known as "Parallel WaveNet," a production model 1,000 faster than the original.

See all

Hub AI

Deep learning speech synthesis AI simulator

(@Deep learning speech synthesis_simulator)

Wikipedia

Hub AI

Deep learning speech synthesis

Given an input text or some sequence of linguistic units $Y$ , the target speech $X$ can be derived by

$X=\arg \max P(X|Y,\theta )$

where $\theta$ is the set of model parameters.

$loss=\alpha {\text{loss}}_{\text{human}}+(1-\alpha ){\text{loss}}_{\text{other}}$

See all

Knowledge Base

Talk Channels

Special Pages

Deep learning speech synthesis

Deep learning speech synthesis

Recent from talks

Recent from talks

Knowledge base stats:

Talk channels stats:

Members stats:

Deep learning speech synthesis

Hub AI

Deep learning speech synthesis

History

Deep learning speech synthesis

Deep learning speech synthesis

Recent from talks

Recent from talks

Knowledge base stats:

Talk channels stats:

Members stats:

Deep learning speech synthesis

Hub AI

Deep learning speech synthesis