Multi-speaker Text To Speech
Dmitry Obukhov, ML Researcher8 minute read
Speech synthesis (Text-to-speech, TTS) is the formation of a speech signal from printed text. In a way, it is the opposite of speech recognition. Speech synthesis is used in medicine, dialogue systems, voice assistants and many other business tasks. As long as we have one speaker, the task of speech synthesis at first glance looks quite clear. When several speakers come into play, the situation becomes somewhat complicated and other tasks come into play; for example, voice cloning and voice conversion, this will be discussed further in the text. This article aims to distinguish in simple terms between these concepts which may seem similar at first glance. I think this write up can serve as a bridge between experts in the field and the people who interact with them.
What you came here for
In the scheme considered in Figure 1, the speech synthesis model generates a voice recording of one given speaker. This is the speaker on which the model was previously trained.
If you want to diversify the synthesis with different voices, then you can train several models on the desired voices - but this approach does not scale well, because each model takes tens of hours of pure speech for training. Therefore, it will not be considered.
Consider a situation in which we want to train AI to speak in the voices of multiple speakers. In a situation where we have several speakers, it is important to take into account the characteristics of each speaker in order to be able to replicate their voice through speech synthesis. To do this, we need to extract voice features in the form of some kind of representation that would be convenient to store and process in a computer - we call this speaker embedding.
It is important to note that the characteristics of the speaker do not depend on time, in contrast to the text or audio signal. This is usually a fixed size vector.
In a perfect world, a single speaker’s embeddings, obtained from different audio samples, will be the same, but for different speakers they will be different. Since it is extremely difficult to achieve such perfection in an imperfect world, as a rule, in speech synthesis the speaker embedding is set in the form of a trainable embedding that depends only on the speaker identifier.
To understand how and where exactly the information about the speaker is taken into account, you need to look a little into the black box of the speech synthesis model.
The best modern synthesis systems use neural network architectures, in which two main components can be distinguished: an encoder and a decoder. In a super simplified scheme, the encoder builds a latent representation based on linguistic features. And the decoder translates this latent representation into acoustic features.
So, in such approaches, the speaker embedding is usually added to the latent representation, and thus the decoder generates features based on the text and the speaker.
Now that the information about the speaker has been taken into account, you can build a multi speaker TTS model that will speak in several desired voices at once.
Okay, our TTS speaks with different voices. Cool, but that's not all!
Speech synthesis is impossible with voices that are absent from the data on which the model is trained. In other words, the model can only speak in voices that it has heard (been trained on).
Generally speaking, having a model speak in an arbitrary voice, according to a given pattern, is a more difficult and less accomplished task at the moment, and is called Voice Cloning. In some scenarios, voice cloning implies the ability to use a small limited amount of the target voice - a few minutes or even only several recorded samples (few-shot voice cloning) to train or fine-tune a model to synthesize a voice indistinguishable from these recordings. In other scenarios, it is assumed that the model sees only one recording with the target voice (one-shot voice cloning). Here, speaker adaptation is a straightforward solution that transfers an average voice model to the target voice using the samples of the target speaker while the average voice model is usually built on speech from multiple speakers.
If we require that the text be specified not in print, but in audio format, then in fact we only need to change the voice in the original audio signal, while preserving the original linguistic content. This task is called Voice Conversion.
One approach to solve for voice conversion is to recognize the linguistic component of the original audio recording and reproduce it in a given voice. For this reason, such approaches are called ASR + TTS. However, complete recognition of the text usually does not occur; instead, synthesis takes some intermediate representation as input. Also, generative adversarial networks (GAN) and variational autoencoders (VAE) have found wide application in the Voice Conversion task.
Until now, we have talked about only two components of speech audio - text, i.e. linguistic information, and speaker information.
Another important characteristic of a speech signal, from the point of view of human perception, is information about the style of pronunciation - emotions, intonation, prosody. It is not at all trivial to understand from the side of the machine what exactly is responsible for the style in the audio signal. Nevertheless, the task of achieving the intonation given in the example in the synthesis is being studied, and is called Prosody Transfer.
A similar problem can arise in terms of Voice Conversion, when it is required to transfer not a unique reference audio voice, but a style from a given audio. The class of these tasks is called Emotional Voice Conversion.
I gave a description of the tasks that I consider to be the main ones in the field of multi speaker speech synthesis.
Of course, this is far from the entire list of approaches, and each of these is subdivided, and more than once, into smaller ones. Such classifications can be based on specific goals, methods of solution, and possible limitations. I deliberately omit such classifications in order to simplify the presentation.