The voice-recognition technology in the Phraselator is just a precursor of a much more technically ambitious initiative that the Defense Advanced Research Projects Agency (DARPA) calls the Babylon project.
The goal: true, two-way, multilingual communication. VoxTec and SRI International together make up one of four groups around the country that have received DARPA funding for Babylon project research. Through the Information Awareness Office, home of the controversial Total Information Awareness system, the military is funding research into two-way translation that focuses on "low population, high-terrorist risk languages."
It's not enough to train military and intelligence personnel in Arabic. Not only can't it be done fast enough, but in the new era of the American war on terrorism, there are too many languages, too many different theaters and too many different enemies for anything but a machine to understand and speak all the necessary tongues.
"Prior to the end of the Cold War, we had a monolith theater," says Sarich. "We had Russian and Eastern European linguists. Now, we find ourselves involved in a lot of new things where we don't speak the language, and finding reliable linguists is very difficult."
The Phraselator currently does not accept responses -- for the obvious reason that, say, Iraqi peasants will have no idea what phrases the machine has been programmed to translate. So the first step to real two-way communication is a machine that will accept and translate a very limited number of responses, such as "yes" and "no" from a respondent. Then, soldiers or doctors can be trained to ask only questions that will elicit those types of answers.
The next level of complexity: training the machine to pick out specific, important words like "doctor" or "danger" spoken by a respondent. That way, something sensible might be communicated, even if full translation can't be achieved.
But the real goal of the Babylon project is not just limited phrase translation, but so-called "free input" -- two-way translation of actual conversation. "The problem is that free input is still too ambitious for current technology today," says Franco. "The technology is not powerful enough to allow us to recognize any speech in any domain, because the error would be too high." Instead, researchers are focusing on free input in a very limited context: such as a doctor communicating with a patient.
At Carnegie Mellon University, a group of computer scientists working with three local Pittsburgh companies, Mobile Technologies, Multi Modal Technologies and Cepstral, last year created a prototype of a device called the "speechlator" that translates medical interviews from English to Egyptian Arabic and back.
The speech-recognition component translates the English to "interlingua," a computer-readable intermediate language. "That language is a mathematical language almost like a logic," says Alan Black, a research computer scientist with Carnegie Mellon.
What's the value of introducing a third language to the equation? It means that instead of solving the problem of translating from English to Spanish, and Spanish to French and French to English, every language can be translated into interlingua. From interlingua, the translation engine generates the Arabic text, then a speech synthesizer vocalizes the Arabic.
Even within the relatively constrained realm of a medical interview, there's endless variation. "We have to have linguists write rules to describe the variations that exist in the language," says Black. "Does your X hurt?'" Then, listing all the possible things that X will be."
The prototype hasn't been tested yet to see how it might perform in real-world conditions; they're awaiting more funding from DARPA later this year.
Of course, the usual problems of machine translation aren't the only challenges for these devices. As any user of desktop voice-recognition software knows, systems can be trained for the vagaries of a specific user's accent and inflection, improving over time the more it's used. But battlefield devices need to be largely user-neutral: ready for use by a number of people. The limited memory of a handheld device is also a real barrier when you're dealing with the variables of actual real conversation.
Although, how much memory does it really take to say "Take me to your weapons of mass destruction"?