Our pioneering speech technologies help people around the world interact with more natural, conversational and intuitive digital assistants and AI tools.
Speech is crucial to interpersonal connections. It helps people around the world exchange information and ideas, express emotions and build mutual understanding. As we improve our natural, dynamic voice generation technology, we deliver richer and more immersive digital experiences.
Over the past few years, we have been pushing the boundaries of sound generation by developing models that can create natural, high-quality speech from a variety of inputs such as text, tempo control, and specific voices. This technology supports audio from a single speaker across many Google products and experiments, including: Gemini liveProject Astra, Voices of travel AND YouTube auto copy — and helps people around the world interact with more natural, conversational and intuitive digital assistants and AI tools.
Working with partners at Google, we recently helped develop two new features that can generate long dialogues with multiple speakers, making complex content more accessible:
- Notebook Audio ReviewsLM turns submitted documents into an engaging and lively dialogue. With one click, two AI hosts summarize user material, make connections between topics, and banter back and forth.
- Illuminate creates formal AI-generated discussions about research articles to make knowledge more accessible and digestible.
Here we provide an overview of our latest speech generation research on which all of these experimental products and tools are based.
Pioneering sound generation techniques
We have been investing in sound generation research for years and discovering new ways to generate more natural dialogue in our products and experimental tools. In our previous research on SoundStormwe first demonstrated the ability to generate 30-second segments of natural dialogue between multiple speakers.
This extended our previous work, Sound stream AND AudioLMwhich allowed us to apply many textual language modeling techniques to the sound generation problem.
SoundStream is a neural audio codec that effectively compresses and decompresses the audio input signal without losing its quality. As part of the training process, SoundStream learns how to map sound to a series of acoustic tokens. These tokens capture all the information needed to reconstruct audio with high fidelity, including properties such as prosody AND timbre.
AudioLM treats audio generation as a language modeling task to produce acoustic tokens for codecs such as SoundStream. As a result, the AudioLM environment makes no assumptions about the type or composition of the sound generated and can flexibly handle a variety of sounds without the need to adapt the architecture, making it a good candidate for modeling multi-speaker dialogues.
















