Saturday, May 17, 2008


I've had a request from a reader, Eric, which I am honored to fulfill: That I post a general call for assistance in his work to enable speech recognition software for Windows to also operate on Linux. This would extremely helpful to our disabled community members who are very dependent on speech recognition software for their access to the Internet. Below is his explanation and appeal. PLEASE pass this on to anyone (or any blog) you feel might be interested in it. You can post responses to the comments section here (which will be read by Eric) and/or contact him directly at


Today, the only really good speech recognition environment is NaturallySpeaking. The authoritative version runs on Windows and there is a port to MacOS 10 by a small company in New Hampshire. Both of these systems have the same problems. They only work with a limited number of applications, and only through great difficulty will they work with remote applications (i.e. Linux applications on another machine). On the political front the primary problem is that both platforms are not open and rather monopolistic in their business philosophies. As result, the Open Source Speech Recognition Initiative was formed. It's a nonprofit organization whose mission is making speech recognition work on Linux.

We recognized early on that doing a port was financially and politically unfeasible. We would always lag behind the current system and would end up having to put together an entire support organization to handle the questions from users. Instead, we decided to go a more practical route which was using Wine and run the native Windows version. We got lucky. At the same time, a small number of people on the Wine project took an interest in supporting NaturallySpeaking and today we have a system which mostly works but can only dictate to applications within the Wine context. This is where our next set of volunteers can help.

There are three levels of functionality that would be useful to disabled speech recognition users on Linux. They are listed in order of importance because each one provides a foundation for the next. The first is fundamental in that you can't do anything without it. The second is a significant uptick in terms of ease of use and application interaction. The third is also very useful but it is not as big a jump in usability (I think) as the first two.

The fundamental need is for basic keystroke injection and context recognition (executable name and taskbar). This would give us the equivalent of what we have today (i.e. natural text) for most applications. We should be able to dictate text and correct it. We should be able to use macro programming environments such as Vocola to implement commands for driving remote applications through keystrokes. This step is the most important one because it would have been able to transition from a split Windows/Linux environment to a Wine+Linux environment. This is not to say it shouldn't operate in a Windows/Linux split environment, because some people need that, but this is the first step to ditching Windows.

The second level is a modification of the dictation box concept. In NaturallySpeaking, dictation box brings up a window and puts into it whatever the application has selected. If there is nothing selected, then the dictation box is empty. In this box, the user can add more text or use Select and Say functionality to edit the text. On completion, the user transfers the text back to the application. The main failings of the current dictation box is that it assumes plaintext, cutting and pasting is handled using Windows APIs os ctrl-c/ctrl-v, which we all know from the dog's breakfast of X11 cutting and pasting, won't work. Another annoyance is that after transferring, the dictation box goes away.

An ideal dictation box would allow cutting and pasting etc. as needed but also allow application-specific cut and paste sequences as well as possibly application-specific plug-ins for presentation and reformatting for reinjection. Don't really know if this last item is needed, but based on what I've seen with the existing dictation box, there are strong hints it might be.

The third level functionality requires a much deeper set of tendrils into the accessibility interfaces and their applications. It would move the same type of functionality found in a dictation box into the applications itself. Enabling Select-and-Say in the application would be a major task and benefit from a usability standpoint.

We need help. It's not easy to write code by voice at the best of times and, quite frankly, some of the environments are downright hostile to speech driven programming. We need some basic tools so that we can start bootstrapping ourselves the rest of the way up. We need help with someone fixing VR mode for Emacs because its current method of identifying Windows for dictation is flawed. We need help getting characters from NaturallySpeaking in Windows or Wine over to Linux. We need help getting over the initial couple of humps.

Thanks for reading. -- Eric


kat said...

my boyfriend is an electrical engineer and is devoted to open-source software. His take on this problem is that open source software works best on the applications where a lot of medium-skilled people can put in a little bit of time.

Speech recognition software is complicated enough that it really needs a lot of very, very skilled people putting in all their time.

His opinion is that these factors are keeping the software in the hands of the corporations.

A search through "" (an open souce forum) gave me links to

(for java--I split the link in two)

which seems to be linux oriented.

I'm sorry I can't help more, and I bet you've seen these....

Anonymous said...

your boyfriend is very smart. I'm often said just because you have access to the code doesn't mean you can do anything about the problems.

Since we live by the guideline of "functionality trumps politics", we have broken the problem into "let the hard stuff reside with someone we can pay money to" and "here are the softer things we can tackle with medium skilled people". Yes, the things I mentioned are medium skilled people level tasks. I believe they're no more difficult than learning how to write GUI code.

In addition to Sphinx, there's also Julius, HTK, dougout, and a bunch of others. They tackled the easy part of speech recognition (the recognizer and maybe some language modeling). If you assume it takes about $10,000,000 and four or five years to make a full function speech recognition environment comparable to NaturallySpeaking, you would spend about $8 million to bring any of these toolkits up to where it was comparable to NaturallySpeaking. Maybe 12 million if the system needs to be completely replaced but people don't discover it for awhile.

I do appreciate your comments and suggestions.

---eric (who doesn't need yet another bloody system ID)