Speech recognition technology keeps improving, but it’s still hopelessly primitive when compared to the capability of a human. And even with on-going improvements in computers, at Moore’s Law rates it’s likely to be 30-50 years before we can build a machine with the computational capacity of a human brain. But there is a brute force approach, based on current computer capabilities, which could substantially improve speech recognition performance in the near future.
To understand what I’m suggesting, we first need to understand the distributed super-computer that Google has built, and look at brute force approaches Google is applying in a parallel field, automatic natural language translation, a.k.a. machine translation.
Google has built what is likely the largest distributed computer in the world, based on commodity PCs, commodity disks and commodity Ethernet switches. It runs a fault-tolerant file system, the Google File System (GFS) and it can be partitioned to run different applications simultaneously. Equally important, they have developed tools that facilitate developing applications to run on the Google system.
Virtually every Internet user is aware of Google’s search service, but not everyone is aware of the numerous other services (also here) that already utilize Google’s enormous computer and storage capacity. A few weeks ago, Google showed some of the new services in development, including an machine translation (MT) service.
Google has found a Rosetta stone at the United Nations Documentation Centre where they've apparently obtained over 200 billion words of parallel translation done by humans over the years. This provides the statistical basis for a brute force assault on machine translation. They don't condense or extract subsets of this data. Instead they have it all available and apparently search all of it for longest matching phrases.
Their demonstration compared existing machine translation systems with the new Google system on one specific phrase in Arabic. The existing translation:
"Alpine white new presence tape registered for coffee confirms Laden."
The Google translation:
"The White House confirmed the existence of a new Bin Laden tape."
Pretty spectacular. Of course, demos are demos and they likely picked a good example for their presentation. We can only speculate when we will have a chance to beta test the Google system and see what it really does. But in parallel, the speech community should be thinking about what a similar approach could do for speech recognition.
Instead of averaging samples from many different speakers to obtain a single combined set of acoustic models, combined word models and a combined language model, could we keep the parametric representation of the entire audio training corpus on-line and search that entire database in real time?
If the answer is yes, think what might result. It's not unreasonable to expect that, given an acoustic sample of unknown origin, the system could come back with, not only the text representation of what was said, but also the fact that it was spoken in English by someone who learned their English in India and subsequently lived in the US for many years.
The computational capacity and storage capacity to support this probably exists within Google today and, given Moore's Law, will be available to all of us long before we have computers with the capacity of a human brain. Is anyone aware of speech researcher taking such a brute force approach?
Note: I haven't addressed the issue of obtaining the large reference data set -- the equivalent of Google's UN documents -- but I have several ideas which I'll address in a future post.