Let’s talk about Automatic Speech Recognition (ASR) technology: the state of the art; user/customer expectations; ASR output vs user/customer expectations; and ask whether there is one ASR engine that meets all requirements? Spoiler alert–the answer to that last question is, no. There’s no single ASR Engine that can satisfy all industry needs. Why not? We will dive into that answer in a bit.
Here’s another question to ask about ASR Technology. Why should I, the consumer, look beyond the big three–Google, Apple, Microsoft…make that four, IBM to assist me in meeting all my ASR Requirements? Obviously they have the biggest R&D budgets and attract the best talent so their technology should be the best, right?
The answer is, it depends.
ASR Technology Hits and Misses
For example, you want Google to turn on the lights–”Google! Turn on the driveway lights.” Or, “Siri! Play my, I’m really depressed mix.” Or “Alexa, I need a Vegan pizza, light pepperoni and cheese.” All of these technologies that use ASR to pick up on voice commands work pretty well.
However, there are a number of cases where these ASR technologies have challenges. One pretty simple example is when I use the speech to text feature on a phone. Between auto correct and incorrect words, it’s definitely not perfect. In fact, what is most frustrating is that it doesn’t learn. I always have to correct my daughters’ names as well as that of my engineering VP–EVERY TIME! This is a slightly different use case than query response, but it’s similar. Typically short sentences, real time transcription and the errors are because the context is free form so there isn’t the possibility of comprehensive training.
How TranscribeMe Uses ASR
The TranscribeMe use case for ASR is neither of these. Eg, “Ok Google! Listen to this one hour audio file and transcribe it with timestamps for every speaker change.” As they say colloquially, “that dog don’t hunt.” Why not? That’s not the use case for Google.
Simplistically, the ASR industry breaks down into two use cases, query/response and audio to text transcription. TranscribeMe continually tests vendors’ speech engines and the big 3 or 4 are never at the top of the list in terms of word error rate for our use case–and that makes sense–audio to text, where audio, not ‘spoken speech’ is not their design target.
An example of a TranscribeMe virtual request might be, “Transcribe this six hour legal deposition with five speakers using the state of Iowa output format and include speaker IDs and speaker change timestamps.” Well, truth time, no ASR engine is going to get that right. But some may be better than others.
So that’s where ASR analysis becomes more sophisticated. We’re not simply looking at word error rates but at other factors, such as which engine punctuates or capitalizes best? Which works best w/ crosstalk? Which is stellar with single speaker or multichannel vs that which can handle multiple speakers on a single channel?
Why do these qualifications matter? Because the speech engine is not going to produce the final output that will be acceptable to the customer. Maybe it will produce output that’s 90% correct–that’s pretty good. What if your car worked 90% of the time–pretty good or totally unacceptable?
No ASR Engine’s Output is Perfect
No ASR engine with a few caveats can produce output that will be acceptable to the customer as a finished product. The ASR engine produces an output that then requires human review and correction for completion. And that human in the loop dictates which engine we use for various customer and use cases–those distinctions I mentioned above: dial up the ASR that excels at single speaker clear audio; or dial up the ASR that accurately timestamps speaker changes; or we need the engine that doesn’t insert gibberish when it doesn’t understand the audio.
In summary, the TranscribeMe use case requires different engines for different types/qualities of audio and for specific use cases. Since we don’t build our own ASR we can shop and use any vendor that fits our needs and provides the best output for human review and correction.
I mentioned a caveat where there are cases that a one pass ASR output can satisfy customer requirements and in our case we have a customer who does further analytics on the ASR output. That analysis may be keyword spotting or sentiment analysis, or other.
As an aside, be wary of any company using their own home grown ASR to process files–one size does not fit all and companies that do produce their own ASRs continually narrow the niches where they play.
Do you have some examples of ways where you have found ASR technology challenging with a project you have worked on? We’d love to know. Are you looking for a company like TranscribeMe to help you with any of your Transcription or AI Datasets and Machine Learning needs?