Underlying Voice2JSON, which I've been exploring the past few days & which underlies Rhapsode's newly-added user input, is a choice of four lower-level speech-to-text backends. Today I want to briefly describe how these differ, then blog over the weekend.


CMU PocketSphinx is probably the fastest & arguably simplest. I don't understand any details, but it measures various aspects of the input audio & matches them to a "language model" of one of a handful of types. Including wakewords.


Kaldi (named for the Ethiopean goatherder who discovered the coffee plant) appears at a glance to works similarly to PocketSphinx, except it supports many, many more types of language models. Including severaltypes of neuralnets (custom implementation). Maybe even combining them?

I have heard that Kaldi's output's better, which makes sense. Interestingly both Kaldi & PocketSphinx provides GStreamer plugins, so how about exposing this automated subtitling @elementary ?


Mozilla DeepSpeech appears to be little more than specially trained Tensorflow neuralnets ("deep learning", since apparantly neuralnets need a new name when they're too big). These tend to require much more training, but once you do they can give excellent results.

One drawback of neuralnets is that it's harder for Voice2JSON to guide it to follow a more constrained grammar. So it converts that model to a "scorer" which inform the neuralnet how well it did so it can improve next time.



Finally Julius seems to be more opinionated about the type of language models it uses, which should help to make it fast & reliable. Specifically it uses "word N-gram and context-dependent HMM" via "2-pass tree-trellis search".

4/4 Fin, for now. Next: Back to GCC's optimization passes, specifically computing valid integer ranges.

And this weekend I'll summarise all these threads into a blogpost for Rhapsode, amongst other site updates.

Sign in to participate in the conversation

For people who care about, support, or build Free, Libre, and Open Source Software (FLOSS).