I like targetting atypical I/O mediums to get my creative juices flowing, improve accessibility, & build something more distinctive whose benefits are clear. For this I assemble my own platforms, so today my thanks goes out to the projects I assembled for voice I/O!

eSpeak NG does sound a bit robotic, but I appreciate their hard work for their extensive internationalization, clarity at fast speeds, & all their knobs which allow me to take full advantage of the auditory medium!

1/2

Follow

On the input side my gratitude goes out to CMU Sphinx, Kaldi, Mozilla DeepSpeech, & Julius! As well as to Voice2JSON/Rhasspy for providing a highlevel wrapper around your choice of these!

Since I believe for privacy's sake it is vital to be running this voice transcription AIs (which the named projects implement in a myriad of ways) locally, which despite that not being the typical approach is how Voice2JSON works.

Please package Voice2JSON, it could be a great component for a freedesktop!
2/2

P.S. Speech-to-text engines aren't near as well internationalized as eSpeak NG's text-to-speech, so that's one of many reasons why I need to allow for keyboard input. Internationalizing speech-to-text is a huge challenge since put simply it requires retraining the whole system!

P.P.S. Feel free to research how to generate more natural sounding voices without sacrificing eSpeak's benefits. I'd love to see what you come up with!

3/2

@alcinnz, cc: @devinprater, for an #eSpeakNG alternative, check CMU's #Festvox. They've got several OSS projects, all with better sounding voices:

* Flite or festival-lite, a small fast portable speech synthesis system.
* Bard Storyteller: an ebook reader that uses Flite.

Both based on the CMU Wilderness Multilingual Speech Dataset, a speech dataset of aligned sentences and audio for ~700 different languages, but it's not licensed and served from a non-HTTPS site: festvox.org/

1/2

@alcinnz @devinprater, the #CMU #Flite speech synthesis system is also available as an #Android app in #FDroid.

It provides a demo page for the speech synthesis system and a #Talkback plugin. The last build has 13 English voices, plus 4 indic ones: Hindi, Tamil, Telugu and Marathi.

This list could be extended for the full dataset, but It needs some love, currently, there are no students on the project.

Available on a non-HTTPS site: cmuflite.org or github.com/festvox/flite

2/2

@walter @devinprater Does Flite provide the internationalization & tone-of-voice controls I love about eSpeak NG?

@alcinnz @walter Mmm, kinda voice controls, not really internationalization.

@alcinnz, I'm not sure how they've "created" the voices, but of the 13 English voices they provide some "pure" English voices and other that have a German, Canadian or Indian accent, and then they provide those Indian (indic) voices. So… probably not… maybe?

I don't know how they are created, but for a demo, you can install the FDroid build: f-droid.org/en/packages/edu.cm

cc: @devinprater

@alcinnz @devinprater, BTW, I came across this while trying to see how we could improve the accessibility of CLIs, but at the same time also address the lack of it during startup, in both bioses and bootloaders.

And we need a solid foundation, similar to the one for video, the frame-buffer, but one for accessibility: screen readers, braille, etc. And for it, we could build a hardware co-processor that would accompany or replace the GPU, no need to waste CPU cycles for GUIs that are not used.

@walter @devinprater The other consideration is "Does text-to-speech actually a co-processor? Is it expensive enough for that?"

Certainly freeing up cycles from graphical UIs would more than make up for the cycles used by text-to-speech.

Not that I haven't toyed with the concept of what this might look like...

@alcinnz @devinprater,
Yes, it sounds like a lot, but the hardware is optional, and we could use existing screen readers, which will receive structured data, not formatted text like now.

I've got a more or less complete picture, I just need to publish some posts and see what's feasible and what's plain stupid. I think that it is doable and this would be a huge boon for accessibility, mainly because the security people might also be interested: separation of concerns and reduced attack surface.

@walter @devinprater Well, I for one am curious to see what you build!

Could be a nice complement to my "Rhapsode" auditory browser...

@alcinnz, I've been researching this for some time now. I even looked at the reasoning and history of Linux code changes for pipes and why we only have 3 standard files descriptors: stdin, stdout and stderr. Adding extra standard I/O FDs while keeping backward compatibility with POSIX pipes and not affecting performance; for which I plan to respond to a related security thread. I've got already about 2000 words, but I need to trim it down for Mastodon, or maybe… a blog? hint:ELF

@devinprater

@alcinnz, I think I that even found a decent binary format for an initial test. I now need to find a decent replacement for printf(), a very light templating system that works well with buffered streams and doesn't reduce performance for the users that don't benefit much from this separation of concerns. Also, binary encoding and fancy pipes bean that we can ditch libvte and "-czf". Then we can have fun building interactive sci-fi GUIs, without POSIX code pages.

@devinprater

@alcinnz @devinprater, thanks, now I've got more naming ideas for the fist blog post, or maybe the series:

"Fancy interactive Sci-Fi GUIs for improved security and accessibility"

My previous working title contained security and accessibility next to something about the economics that lead to the staggered keyboard and having to leave that legacy behind.

@alcinnz @devinprater, and this thread explains one of the reasons I would like to have a hardware accessibility co-processor that replaces the GPU. But read it in full, because the topic kind of changes in the end.

programist.ro/@walter/10861818

@walter @devinprater What I can't stop thinking reading this thread is that this is precisely what I like about eSpeak NG! Its less of a chorus & more of a (relatively) talented narrator. But this very much describes what I'm exploring with Rhapsode, though that operates upon HTML markup rather than punctuation.

Why I'm so reluctant to give up eSpeak's strengths, or as with Gemini inline markup!

Still not clear it needs a coprocessor...

@alcinnz flite with a good model (eh i got a file called "cmu_us_aew.flitevox" no idea where i got it.) does a bunch better.

Noticed, rhvoice, decided to try it now. Confusingly silently does nothing without a voice. You need a voice and a language package and to specify the voice. ( wiki.archlinux.org/title/RHVoi ) Seems pretty good?

Tried vosk via github.com/ideasman42/nerd-dic also with lots of changes myself.. Even with not great mike got good results, tho not sure how well it works in practice.

@alcinnz oh flite got `--setf duration_stretch=0.9` too, apparently i thought it sounded a bit better slowed down a bit.

@jasper Haven't thought about Flite for a while...

I think I bounced off it when I saw it didn't provide all the controls eSpeak does. I am making full use of those, and don't want to give any up!

Thanks for the advice anyways!

@alcinnz I need to find time to play with some of these. I wrote my home automation software with the goal of taking voice input, but never actually found a solution I liked for it. But there's a lot more great options out there now.

@alcinnz The biggest thing I wanted was not coupling speech to text with the recognition engine needing pre-fed structures or trying to do it's own intent detection. I really want to just feed a text string into my badly written software of what I said. I remember Rhasspy seeming very promising for that, but I never tried it. Will look at Voice2JSON.

@ocdtrekkie Yeah maybe Voice2JSON isn't for you then... It is an intent parser upon a choice of backend, yielding JSON to your app representing what it parsed.

@alcinnz I should also maybe consider abandoning my hope for plain text string outputs, but I found shoehorning the commands I wanted into those platforms was often really difficult.

@nixfreak It has its strengths & drawbacks. Personally I tend to use the Sphinx backend, though I'm sure Kaldi would be better.

@nixfreak It has its (smarthome) uses, but still: It is intended to be run on a server, even if a self-hosted one. So I tend to shy away.

@alcinnz so I'm looking for a TTS , so I can create a POC for automating scripts and other things. I would like to use #nim , so if I have to create a wrapper or creat it from scratch that's fine. I have no idea where to start though.

@nixfreak I'm personally partial to eSpeak NG for its tone-of-voice knobs I haven't seen matched with fancier tech & extensive internationalization! But if you can do without that & want something more natural-sounding, theres several other options. Flite's been getting recommended to me recently...

@nixfreak Tip with eSpeak NG though: The shell command has a poor scanner in its XML parser. It is a lot less buggy for long text when embedded as a library as opposed to piped into that shell command.

@alcinnz Kick ass, Flite is all in ANSI C, it contains no C++ or Scheme, thus requires more care in programming, and is harder to customize at run-time.

Sign in to participate in the conversation
FLOSS.social

For people who care about, support, or build Free, Libre, and Open Source Software (FLOSS).