Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Israeli AI startup aiOla has announced the launch of Whisper-Medusa, an open-source speech recognition model that is 50% faster than OpenAI’s famous Whisper. The new model uses a novel “multi-head attention” architecture that predicts far more tokens at a time than the OpenAI offering.
Whisper-Medusa builds on Whisper but with a unique multi-head attention mechanism that allows it to jointly attend to information from different representation subspaces at different positions. This enables the model to predict ten tokens at each pass, resulting in a 50% increase in speech prediction speed and generation runtime while maintaining the same level of accuracy as Whisper.
The company has released the code and weights on Hugging Face under an MIT license that allows for research and commercial usage. By releasing their solution as open source, aiOla encourages further innovation and collaboration within the community, which can lead to even greater speed improvements and refinements as developers and researchers contribute to and build upon their work.
The technology has the potential to pave the way to compound AI systems that could understand and answer whatever users ask in almost real-time. It is not only driving key functions across sectors like healthcare and fintech but also powering very capable multimodal AI systems.
To develop Whisper-Medusa, aiOla modified Whisper’s architecture to add a multi-head attention mechanism, which enabled the model to predict ten tokens at each pass rather than the standard one token at a time. The company has started with a 10-head model and plans to expand to a larger 20-head version capable of predicting 20 tokens at a time, leading to faster recognition and transcription without any loss of accuracy.
The speech recognition model was trained using a machine-learning approach called weak supervision, where audio transcriptions generated by the model were used as labels to train additional token prediction modules. The company has tested the novel model on real enterprise data use cases to ensure it performs accurately in real-world scenarios.
Source: https://venturebeat.com/ai/aiola-drops-ultra-fast-multi-head-speech-recognition-model-beats-openai-whisper/