Underlying Voice2JSON, which I've been exploring the past few days & which underlies Rhapsode's newly-added user input, is a choice of four lower-level speech-to-text backends. Today I want to briefly describe how these differ, then blog over the weekend.
CMU PocketSphinx is probably the fastest & arguably simplest. I don't understand any details, but it measures various aspects of the input audio & matches them to a "language model" of one of a handful of types. Including wakewords.
Kaldi (named for the Ethiopean goatherder who discovered the coffee plant) appears at a glance to works similarly to PocketSphinx, except it supports many, many more types of language models. Including severaltypes of neuralnets (custom implementation). Maybe even combining them?
I have heard that Kaldi's output's better, which makes sense. Interestingly both Kaldi & PocketSphinx provides GStreamer plugins, so how about exposing this automated subtitling @elementary ?
Finally Julius seems to be more opinionated about the type of language models it uses, which should help to make it fast & reliable. Specifically it uses "word N-gram and context-dependent HMM" via "2-pass tree-trellis search".
4/4 Fin, for now. Next: Back to GCC's optimization passes, specifically computing valid integer ranges.
And this weekend I'll summarise all these threads into a blogpost for Rhapsode, amongst other site updates.
For people who care about, support, or build Free, Libre, and Open Source Software (FLOSS).