Speech recognition technology keeps improving, but it’s still hopelessly primitive when compared to the capability of a human. And even with on-going improvements in computers, at Moore’s Law rates it’s likely to be 30-50 years before we can build a machine with the computational capacity of a human brain. But there is a brute force approach, based on current computer capabilities, which could substantially improve speech recognition performance in the near future.
To understand what I’m suggesting, we first need to understand the distributed super-computer that Google has built, and look at brute force approaches Google is applying in a parallel field, automatic natural language translation, a.k.a. machine translation.
Google has built what is likely the largest distributed computer in the world, based on commodity PCs, commodity disks and commodity Ethernet switches. It runs a fault-tolerant file system, the Google File System (GFS) and it can be partitioned to run different applications simultaneously. Equally important, they have developed tools that facilitate developing applications to run on the Google system.
Virtually every Internet user is aware of Google’s search service, but not everyone is aware of the numerous other services (also here) that already utilize Google’s enormous computer and storage capacity. A few weeks ago, Google showed some of the new services in development, including an machine translation (MT) service.
Google has found a Rosetta stone at the United Nations Documentation Centre where they've apparently obtained over 200 billion words of parallel translation done by humans over the years. This provides the statistical basis for a brute force assault on machine translation. They don't condense or extract subsets of this data. Instead they have it all available and apparently search all of it for longest matching phrases.
Their demonstration compared existing machine translation systems with the new Google system on one specific phrase in Arabic. The existing translation:
"Alpine white new presence tape registered for coffee confirms Laden."
The Google translation:
"The White House confirmed the existence of a new Bin Laden tape."
Pretty spectacular. Of course, demos are demos and they likely picked a good example for their presentation. We can only speculate when we will have a chance to beta test the Google system and see what it really does. But in parallel, the speech community should be thinking about what a similar approach could do for speech recognition.
Instead of averaging samples from many different speakers to obtain a single combined set of acoustic models, combined word models and a combined language model, could we keep the parametric representation of the entire audio training corpus on-line and search that entire database in real time?
If the answer is yes, think what might result. It's not unreasonable to expect that, given an acoustic sample of unknown origin, the system could come back with, not only the text representation of what was said, but also the fact that it was spoken in English by someone who learned their English in India and subsequently lived in the US for many years.
The computational capacity and storage capacity to support this probably exists within Google today and, given Moore's Law, will be available to all of us long before we have computers with the capacity of a human brain. Is anyone aware of speech researcher taking such a brute force approach?
Note: I haven't addressed the issue of obtaining the large reference data set -- the equivalent of Google's UN documents -- but I have several ideas which I'll address in a future post.
I think you are hinting at the next level of computer pattern matching and that is understanding of the context or meaning of the whole. Google is a unique company in that that have the largest of data sets to pull from and cross reference. Meaning of a object comes from understand of the background, individual and environment in which it is used. The larger the data to pull from the more likely the correct pattern would emerge. A trend of understanding if you will. Not to say we as humans, based on our background, always derive the correct meaning of something, just ask my wife.
Posted by: Dustin Wish | June 06, 2005 at 02:30 PM
Update:
I finally wrote up my ideas on how we might accumulate some large public speech corpora. They're in blog entries for August 2nd and 3rd.
http://blogs.nmss.com/communications/2005/08/my_presentation.html
http://blogs.nmss.com/communications/2005/08/large_speech_co.html
Posted by: Brough | August 03, 2005 at 05:26 PM
I am very interested in Speech Recognition as it applies to the transciption of audio stories. I can see a time in the future when we can phone in a story, idea or comment and have both the audio and text available for anyone interested. Is someone out there working in this area? Please contact me.
Posted by: Barry Brilliant | April 21, 2006 at 07:29 AM