OpenAI Unveils Advanced Voice Models for Seamless Transcription and Speech

OpenAI has announced three new proprietary voice models: gpt-4o-transcribe, gpt-4o-mini-transcribe, and gpt-4o-mini-tts. These models are designed to excel at transcription and speech in noisy environments, with diverse accents and varying speech speeds across 100+ languages.

The gpt-4o-transcribe model boasts a 2.46% error rate in English, significantly lower than OpenAI’s two-year-old Whisper open-source text-to-speech model. The new models include noise cancellation and a semantic voice activity detector to improve transcription accuracy.

OpenAI is making these models available through its API for third-party software developers to build their own apps. Individual users can also access the demo site, OpenAI.fm, for limited testing and fun.

The pricing for the new models starts at $6.00 per 1M audio input tokens for gpt-4o-transcribe, with options for lower-cost plans for transcription-only services.

Industry adoption of these advanced voice models is already underway, with companies like EliseAI and Decagon reporting significant improvements in voice AI performance. However, some critics have expressed concerns that OpenAI’s focus on real-time conversation may shift away from its previous emphasis on low-latency conversational AI.

Despite the competition in the AI transcription and speech space, OpenAI remains committed to refining its audio models and exploring custom voice capabilities while ensuring safety and responsible AI use. The company is also investing in multimodal AI, including video, to enable more dynamic and interactive agent-based experiences.

Source: https://venturebeat.com/ai/openais-new-voice-ai-models-gpt-4o-transcribe-let-you-add-speech-to-your-existing-text-apps-in-seconds