OpenAI launched the Whisper API today, which is a hosted form of the open source Whisper speech-to-text model that the firm revealed in September, in synchronization with the rollout of the ChatGPT API.
Whisper is an automatic speech recognition system which is being offered for the price of $0.006 per minute. OpenAI declares that it is capable of providing accurate transcription and translation from multiple languages into English with documents in a range of formats, including M4A, MP3, MP4, MPEG, MPGA, WAV, and WEBM.
A large number of businesses have manufactured remarkably proficient speech identification tools, utilized as the primary component for software and services from tech leaders such as Google, Amazon and Meta. Nonetheless, what makes Whisper stand out is its training on 680,000 hours of many languages and “multitask” data sourced from the Internet, says Greg Brockman, OpenAI president and chairman, which causes improved recognition of distinct accents, ambient sound and technical terminology.
Brockman expressed during a video call with TechCrunch yesterday that the model released by them wasn’t sufficient to inspire the whole creation of a development network. He further mentioned that their Whisper API is based on a large, open-source model, yet it is optimized to offer better speed and ease of use.
As endorsed by Brockman, there are a plethora of hindrances impeding the acceptance of voice transcription technology in industries. A study conducted by Statista in 2020 revealed the three most prominent obstacles to adoption to be shaky accuracy, recognition troubles linked to accents or dialects and the expense involved.
OpenAI has stated that their Whisper program may not always provide accurate next-word predictions due to its use of large, potentially imperfect data sets. Additionally, the system may at times include words in its transcriptions that were not actually spoken, presumably due to its dual purpose of predicting the next spoken word and transcribing the audio recording. PER does not function identically with all languages, and its error rate goes up for languages that are not included in the data that it was trained with.
The fact that speech recognition technologies can be biased against certain groups of people is unfortunately nothing new. In 2020, a study from Stanford found that the speech recognition systems from big tech companies including Amazon, Apple, Google, IBM and Microsoft were far less reliable for people with Black backgrounds than those with white backgrounds, at 19% error rate.
OpenAI still sees the potential application of Whisper’s transcription abilities to enhance current applications, products and tools. Speak, an AI language learning application, is now utilizing the Whisper API to create a virtual conversational companion for its users within the app.
If OpenAI is successful in penetrating the speech-to-text industry, it might generate considerable revenue for the Microsoft-funded firm. According to one estimate, the sector is projected to be work $5.4 billion by 2026, increasing from $2.2 billion in 2021.
Brockman asserted that they aspired to become a kind of omnipresent cognition. He also expressed that they wanted to be able to process any kind of information someone may have, and to provide a resource that would amplify their abilities.