Our pioneering speech generation technologies help people around the world interact with more natural, conversational and intuitive digital assistants and AI tools.
Speech is crucial for a human connection. It helps people around the world exchange information and ideas, express emotions and create mutual understanding. Because our technology built to generate natural, dynamic voices will constantly improve, we unlock a richer, more engaging digital experience.
Over the past few years, we have pressed the limits of sound generation, developing models that can create high -quality, natural speech from a series of input data, such as text, pace control and special voices. This technology powers one-way sound in many Google products and experiments Live twinsProject Astra, Travel votes AND Auto -Dubbing YouTube – And helps people around the world interact with more natural, conversational and intuitive digital assistants and AI tools.
Working with partners on Google, we have recently helped to develop two new functions that can generate long, multi -purpose dialogue to increase the availability of complex content:
- Review of Notebook Sound It turns the sent documents into an engaging and lively dialogue. After one clicking, two AI hosts summarize the user's materials, create connections between topics and a joke forward and back.
- Illuminate It creates formal discussions generated by AI about research documents to make knowledge more accessible and digestible.
Here we present a review of our latest speech generation research underlying all these products and experimental tools.
Pioneering sound generation techniques
For years, we have been investing in research on sound generation and study new ways of generating a more natural dialogue in our products and experimental tools. In our previous studies on Sound stormFirst, we showed the ability to generate 30-second segments of natural dialogue between many speakers.
This has expanded our previous work, SoundStream AND AudiolmWhich allowed us to use many language modeling techniques based on text to generate sound.
SoundStream is a neuron audio codec that effectively compresses and decompresses the audio input without exposing its quality. As part of the SoundStream training process, he learns how to map sound to a series of acoustic tokens. These tokens capture all the information needed to reproduce the sound of high loyalty, including properties such as prosody AND timbre.
Audiolm treats sound generation as a language modeling task to produce acoustic tokens of codecs such as SoundStream. As a result, Audiolm Framework does not assume any assumptions regarding the generated type or composition of the sound and can flexibly handle various sounds without the need for architectural adaptation-what makes him a good candidate for modeling multiplayer dialogues.
An example of a multi -component dialogue generated by the Audio Notebooklm review, based on several documents related to potatoes.
Based on these studies, our latest speech generation technology can cause 2 minutes of dialogue, with better naturalness, speakers consistency and acoustic quality when it receives a script for dialogue and speaker turning markers. The model also performs this task in less than 3 seconds on one TENSOR (TPU) V5E CHIPin one application. This means that it generates sound more than 40 times faster than in real time.
Scaling of our sound generation models
Scaling our models of generating unilaterals for multi -purpose models then became a matter of data capacity and model. To help our latest speech generation model in creating longer speech segments, we have created even more efficient speech codec to the squeezing sound in the tokens sequence, in just 600 bites per second, without prejudice to the quality of its exit.
Toxes produced by our codec have a hierarchical structure and are grouped according to the time frame. The first tokens in the group capture phonetic and prose information, while the last tokens encode good acoustic details.
Even in our new speech codec, creating a 2-minute dialogue requires the generation of over 5000 tokens. To model these long sequences, we have developed specialized Transformer Architecture that can effectively support information hierarchies, matching the structure of our acoustic tokens.
Thanks to this technique, we can effectively generate acoustic tokens corresponding to the dialogue, as part of one self -ordant application. After generating these tokens, you can decode back to the audio using our speech codec.
Animation showing how our model of speech generation generates an autoregressive stream of audio tokens, which are decoded back to the course of two -way dialogue.
To teach our model how to generate a realistic exchange between many speakers, we claimed it for hundreds of thousands of hours of speech. Then we financed it on a much smaller set of data of dialogue with high acoustic quality and precise annotation of the speakers, consisting of uncliphered conversations from many voice and realistic actors Disfluctions – “Umm” and “Aah” of a real conversation. This step taught the model how to reliably switch between the speakers during the generated dialogue and get out only from the sound of studio quality with realistic breaks, tons and time.
According to ours AI rules And our involvement in the responsible development and implementation of artificial intelligence technology, we include our synthesizer technology in the sound sign not to be briefly generated AI from these models to help protect the potential improper use of this technology.
New experiences from speech in advance
We will now focus on improving the liquidity of our model, acoustic quality and adding a larger number of fine -grained controls for functions such as prosody, while examining how to combine these progress with other methods such as video.
Potential applications for advanced speech generation are huge, especially in combination with our family of twins models. From increasing educational experiences to making content more commonly available, we are glad that we can continue to cross the limits of what is possible thanks to voice technologies.
Thanks
Author of this work: Zalán Borsos, Matt Sharifi, Brian McWilliams, Yunpeng Li, Damien Vincent, Félix de Chaumont Quitry, Martin Sundermeyer, Eugene Kharonov, Alexe Khaitonov, Alex Tudor, Victor Ungorean Karolis Misianas, Serit Girgin Jake Waller and Marco Tagliasacchi.
Thank you to Leland Rechis, Ralph Leith, Paul Middleton, Poly Pat, Minh Trong and Rj Skerry-Ryan for their critical efforts in the field of dialog.
We are very grateful to our colleagues in laboratories, lighting, cloud, speech and YouTube for their unique work introducing these models to products.
We also thank Françoise Beaufay, Krishna Bharat, Tom Hume, Simon Tokumine, James Zhao for their project tips.