I've been trying out several speech to text providers and have much to report.
Let's skip the preamble:
1. Deepgram (good)
2. AssemblyAI (good)
3. AWS Transcribe (bad)
NOTE: OpenAI provides a Whisper API but no diarization (see below), so it was not considered. Likewise, I can run Whisper for free with great transcription speed but no diarization.
## Diarization (Speaker Segmentation)
For whatever reason "diarization" is the term of art used most often in the world of speech-to-text / ASR. Yes, there are lots of terms in this field. It just means automatically detecting when the speaker changes in a recording.
For my purposes it means I can display a text transcript that is split up by speaker. This is very, very useful for any interested party.
# Deepgram & AssemblyAI
These are both very good. I started out with Assembly AI since they are pretty cheap, especially with their `nano` model. However, the `nano` model really isn't great for my use case so I had to go exploring elsewhere.
The `best` model is quite good, and seemed to be able to handle spoken surnames and city names quite well. I was very pleased with the output. The only reason I think it loses out to Deepgram is that in a few cases the speaker diarization completely failed. All the text was smashed together under a single speaker. The same files, uploaded to Deepgram resulted in separate speakers. The accuracy was far from 100% even with Deepgram, but at least there was some segmentation.
# AWS Transcribe
Horrendously bad. Unexpectedly bad. I was thinking I'd get some industrial strength transcription, especially for meeting recordings (my use case), but no such luck. The transcription was ugly, lacked capitalization, and the speaker segmentation was wildly inaccurate.
Maybe it's just that this product existed before Whisper, and Whisper is better. Who knows. However, I would not pursue this avenue unless you have lots of AWS credits and nothing else to do with them...
# Future Work
- Azure
- GCP?
- LemonFox (https://www.lemonfox.ai/apis/speech-to-text)
- I read somewhere that the diarization isn't great, but I can't find the link now.