Google’s Advances in Natural-Sounding Speech Generation

Google has made significant strides in speech generation technology, enabling more natural and conversational digital assistants and AI tools. The company’s pioneering techniques for audio generation have led to improved models that can produce high-quality, dynamic voices.

Recent advancements include the development of features like NotebookLM Audio Overviews and Illuminate, which generate long-form dialogue for complex content. These capabilities are powered by Google’s speech generation technology, which is used in various products and experimental tools.

Google’s research has focused on developing more efficient audio compression methods and neural networks that can handle multi-speaker dialogues. The company’s latest model can produce 2 minutes of dialogue, with improved naturalness, speaker consistency, and acoustic quality.

To improve performance, Google created a specialized Transformer architecture to efficiently generate acoustic tokens, which are then decoded into an audio waveform using its speech codec. The model was trained on hundreds of thousands of hours of speech data and fine-tuned on high-quality dialogue datasets.

As part of its commitment to responsible AI development, Google is incorporating SynthID technology to watermark non-transient AI-generated audio content. This move aims to prevent the potential misuse of this technology.

Google’s future plans focus on improving model fluency, acoustic quality, and adding fine-grained controls for features like prosody. The company envisions combining these advances with other modalities, such as video, to create new speech experiences.

Source: https://deepmind.google/discover/blog/pushing-the-frontiers-of-audio-generation