Gemini 2.5 marks a significant step forward in AI-powered audio dialog and generation, enabling effective real-time communication and rich, nuanced conversations. The multimodal AI model can reason and generate speech natively in audio, incorporating natural conversation, style control, tool integration, and more.
With Gemini 2.5 Flash preview, users can experience natural voice interactions with remarkable quality, adopting specific accents, tones, and expressions while conversing fluidly. The system is also trained to discern and disregard background speech, responding when appropriate.
Advanced thinking dialog capabilities enhance the conversation, leading to coherent and intelligent interactions, particularly for complex reasoning tasks. Controllable text-to-speech (TTS) technology offers unprecedented control over generated audio, allowing developers to generate short snippets or long-form narratives with precise dictation of style, tone, and emotional expression.
Gemini 2.5 Pro Preview provides state-of-the-art quality on complex prompts, while Gemini 2.5 Flash Preview is suitable for everyday applications. The model’s support for multilinguality, audio-video understanding, and affective dialog also enables seamless conversations across languages and contexts.
Developers can now build richer, more interactive applications using the Gemini API in Google AI Studio or Vertex AI. Native audio outputs are available to explore native audio dialog with Gemini 2.5 Flash preview and controllable speech generation (TTS) for announcements, stories, podcasts, video games, and more.
Source: https://blog.google/technology/google-deepmind/gemini-2-5-native-audio