A team of scientists from Microsoft has introduced a new AI system that is able to imitate the voice of a person with recording only three seconds. Scientists trained Language of neural codecs called VALL-E Using discrete codes from an unusual neural audio code model and watch out for speech (TTS) as a conditional language model, not continuous signal regression.
The new application was created on the basis of the Encodec Encodec audio compression technology and originally aimed at improving the quality of telephone conversations. Further work has shown that the model is capable of much more. Vall-E can not only imitate the voice, but also simulate tone and even copy the acoustics of the environment in which the original recording was made. For example, if the original recording was made of a telephone conversation, the result will resemble a telephone conversation.
VALL-E developers used over 60,000 hours of recordings during the initial training stage, which is hundreds of times greater than the amount of materials used for other existing systems. Vall-E appears in the context of learning the ability to learn in context and can be used to synthesize a personalized high-quality speech, using as little as a 3-second audio recording.
In addition to shortening training time to generate a new voice, Vall-E creates a much more natural synthetic sound than other models. According to the results of the experiments, VALL-E significantly exceeds the current TTS systems in terms of the naturalness of speech's speech and similarity.
See the demo model on website.
In the samples shown on this page, the “Speaker's hint” column contains speech samples. In the “Truth Ground Truth” column, a text pronounced by the voice of a person as a recorded sample is required. The “basic” column is an example of the traditional synthesis of speech text. And finally, the “Vall-E” column shows the result of the work of the new AI model.
Try the convenient TTS service provided by QDAT as a free example of traditional internet text converters for speech. It is completely free and available for both computer and mobile devices.
Microsoft did not create a source code for a public Vall-e, noting that it can bear a potential risk associated with improper use of a model, such as pretending to be identified by voice or impersonating a specific speaker. Therefore, anyone who wants to test the operation of the model will not be able to.
See also:
Unofficial implementation of Pytorch Vall-E, based on Encodec tokenizer.