Model of the language of neural codecs

May 24, 2025

124

A team of scientists from Microsoft has introduced a new AI system that is able to imitate the voice of a person with recording only three seconds. Scientists trained Language of neural codecs called VALL-E Using discrete codes from an unusual neural audio code model and watch out for speech (TTS) as a conditional language model, not continuous signal regression.

The new application was created on the basis of the Encodec Encodec audio compression technology and originally aimed at improving the quality of telephone conversations. Further work has shown that the model is capable of much more. Vall-E can not only imitate the voice, but also simulate tone and even copy the acoustics of the environment in which the original recording was made. For example, if the original recording was made of a telephone conversation, the result will resemble a telephone conversation.

VALL-E developers used over 60,000 hours of recordings during the initial training stage, which is hundreds of times greater than the amount of materials used for other existing systems. Vall-E appears in the context of learning the ability to learn in context and can be used to synthesize a personalized high-quality speech, using as little as a 3-second audio recording.

In addition to shortening training time to generate a new voice, Vall-E creates a much more natural synthetic sound than other models. According to the results of the experiments, VALL-E significantly exceeds the current TTS systems in terms of the naturalness of speech's speech and similarity.

See the demo model on website.

In the samples shown on this page, the “Speaker's hint” column contains speech samples. In the “Truth Ground Truth” column, a text pronounced by the voice of a person as a recorded sample is required. The “basic” column is an example of the traditional synthesis of speech text. And finally, the “Vall-E” column shows the result of the work of the new AI model.

Try the convenient TTS service provided by QDAT as a free example of traditional internet text converters for speech. It is completely free and available for both computer and mobile devices.

Microsoft did not create a source code for a public Vall-e, noting that it can bear a potential risk associated with improper use of a model, such as pretending to be identified by voice or impersonating a specific speaker. Therefore, anyone who wants to test the operation of the model will not be able to.

Model of the language of neural codecs

LEAVE A REPLY Cancel reply

APLICATIONS

META has developed a model AI, which can transform brain activity...

European A.I. Leader Targets U.S. Tech Giants

Gas Koala’s Jun 2024 Guide: How to Claim Koala AI Airdrops...

Kaggle Game Arena evaluates AI models through games

HOT NEWS

Report: US government analysis warns that EU AI Act may hinder...

AI copywriters are changing the game – but who is really...

Google I/O 2025: What can you expect, including updates to Gemini...

Using “vibe” in Vibe coding: how to work faster without losing...

POPULAR POSTS

Advantages and Disadvantages of the Top 14 AI Applications in 2024

National Recognition for GPHA Takoradi Hospital’s A.I. Application Focus Lab Week...

KRISP uses artificial intelligence to help Indians sound like Americans on...

POPULAR CATEGORY

Oracle surpasses quarterly profit expectations driven by increased demand for AI,...