As promised, here's what I attempted to convey in my presentation at SpeechTEK yesterday. The PowerPoint for my presentation should be available soon, however on Susan Berkley's advice I didn't actually use my (mostly text) presentation but instead just talked.
The speech industry has had 20-25 years of continuous improvement and, throughout the 80s and the 90s, a continuous stream of new start-ups bringing the latest from academia to the "real world." In recent years, that's changed. The dominant trend has been business consolidation. Have we run out of interesting new things (that are also useful)? Maybe in the short term, but I argue, not in the longer term (say 2 - 5 years)!
I cited trends in mobile communications, silicon evolution and the Internet that suggest we're poised for a new round of progress. These trends have three main impacts which I treated in order from moderately significant to very significant.
The first has to do with speech recognition over the telephone. Here there are two encouraging trends. First, the advent of Skype is making people aware of wideband audio and the fact that telephony speech doesn't have to sound so bad. As most of my audience was already aware, traditional telephone speech is missing the high frequencies making it hard for anyone, human or machine, to tell the difference between a spoken "c", "z" or "e". Skype is showing the world that telephony doesn't have to sound bad. With audio over IP, telephony can be HiFi. Of course a telephony revolution may take a decade or more, but awareness is the first step.
A trend with the potential for more immediate impact is the advent of mobile handsets the support simultaneous voice and data. Until recently mobile handsets that supported data forced you to choose either voice or data -- you couldn't use your data connection while you were talking. Now that simultaneous voice and data is becoming possible, we have the opportunity to deploy the Aurora Distributed Speech Recognition (DSR) technology that was standardized by ETSI some years ago. If you are not familiar with DSR, it provides a way to extract the acoustic parameters needed for speech recognition using software on a mobile handset and then send those parameters to a recognition server over a data path independent of the normal voice path. This optimizes for battery life on the handset while avoiding the speech coding degradations imposed by normal mobile phone technology. Up to this point, DSR has not been widely deployed because there was no way to include the DSR data in a normal mobile phone call. With emerging mobile phones (supporting simultaneous voice & data) we have the potential to include DSR (over mobile data) with any normal voice call. This approach is possible today and will be increasing viable as these new mobile phones are deployed over the next few years.
My second major point had to do with algorithms. The potentially disruptive trends here are the emergence of multiple CPU cores per chip and the development of supercomputers built from hundreds or thousands of commodity computers. For 30 years the speech recognition industry has leveraged increasing CPU clock speeds. Yes, we've also gone from 16-bit to 32-bit to 64-bit CPUs, but the dominant trend has been increasing clock speed. However clock speed increases will play a smaller role in the future as silicon evolution becomes dominated by multi-core approaches. Over the next 3-5 years, we'll see 2, 4, 8 and even 16 Intel CPU cores per chip, with only modest increases in clock speeds.
Going forward, the speech industry can take advantage of these trends but, at a minimum, we need to rewrite existing software to leverage parallel processing. Ideally, this transition will foster significant new algorithmic approaches whether they are relatively specific changes like those suggested by Shinozaki & Furui or the outcome of major research efforts like those suggested by Jim Baker under the rubric "Extreme Speech Recognition".
My third and final point emerges from community projects and social software efforts which leverage the extremely low transaction costs that are possible with the Internet. I referred to phenomena like open source software, Wikipedia and the reviews posted on Amazon. In another vein, Google mines information (web links), that hundreds of millions of web sites have posted for their own purposes, to compute the values (the "page rank") of pages on the Internet.
Aside: In 1937 the economist Ronald Coase (rhymes with hose) explained the emergence of firms -- corporations and the like who aggregate services "in house" in order to reduce the transaction costs of acquiring similar services in the market place. In the emerging studies of the Internet's impact on social and business structures, there is a very interesting paper by Yochai Benkler entitled "Coase's Penguin" in the Yale Law Review. "Penguin" in this case is the Linux mascot as Benkler examines the nature of cooperative projects that leverage the Internet to drive transaction costs so low that non-monetary issues predominate.
How does this relate to speech recognition? Further progress in speech requires access to large speech corpora. Today, there are private organizations that have audio data they can't share due to privacy reasons or won't share for business reasons. Publicly available speech corpora includes thousands of hours of speech -- perhaps ten or twenty thousand -- but not the millions or tens of millions of hours in multiple different languages that will be needed to leverage the massively parallel speech engines of the future. But we could obtain tens of millions of hours of public, annotated speech data, if we think about the problem in new ways. I gave one possible example...
Some of you may be familiar with Flickr, a website which hosts photographs for people. While you can make your postings private, more than 80% of the posted photographs are made public. Consider if you had an equivalent web site for Camcorder videos... There may be fewer camcoders than cameras, but there are still tens of millions of Camcorders in use and with each video there is a sound track -- typically people speaking. Suppose you provided machine transcriptions for the audio associated with all these videos. That could be a benefit to the amateur who made the video if it improved their ability to search within the video. If you also provided a really simple user interface, you could get users to flag transcription errors and, in many cases, correct errors. What if we matched the Flickr growth rate? (The following is from Google Answers):
A June, 2005 news report citing a “company spokesman” states that
Flickr has 775,000 registered users and 19,5 million photos and a 30
percent monthly growth rate.
The month before Yahoo! acquired Flickr, Stewart Butterfield, Flickr’s
CEO, stated in an interview that Flickr had 270,000 users and 4
Even without that kind of growth rate it's not unreasonable to think of acquiring millions or tens of million hours of speech recordings, with user-corrected machine transcriptions, over a period of a few years. What's needed is a new way of thinking about the issue -- one that draws on emerging trends in community projects facilitated by near zero transaction costs.
Finally, I closed with a variant of my usual upbeat view of the future, i.e. the spread of mobile phones and the Internet is having a dramatic positive impact on mankind; speech is the most natural user interface; and growth in the underlying technologies (per Moore's Law) support continued improvement in speech recognition performance. So, speech technology will remain an exciting field to be working in -- one that will undoubtedly generate multiple new rounds of excitement (and new companies) in the years ahead.