I didn’t attend SpeechTek this year and I haven’t had much involvement with speech technology in the past 12 months, but this post from Google Research in August 2006 brings me back to the topic I was discussing in August 2005. Google is offering researchers some interesting data.
We processed 1,011,582,453,213 words of running text and are publishing the counts for all 1,146,580,664 five-word sequences that appear at least 40 times. There are 13,653,070 unique words, after discarding words that appear less than 200 times.
Over one trillion word seuences! Of course this is a text corpus — useful to an automatic speech recognizer as a ranked set of likely word sequences, but only to the extent people’s speech patterns match our written language…
What’s really needed is a large body of annotated speech recordings — millions of hours. In my presentation at SpeechTek last year, I suggested one way such a speech corpus could be accumulated. On a free video storage site, provide machine translations of the audio associated with user-created videos. Offer users the ability to search for specific events in their videos based on words in the sound track and also offer them a simple interface to correct the machine transcription. Useful search is the incentive, a human verified transcription is the byproduct. I recognize these annotations would not be "professional", but it’s more productive for researchers to focus on automating quality measures for millions of amateur annotations, than to produce another 1000 hours of professional transcription.
When I prepared my 2005 presentation, I hadn’t heard of YouTube or Google Video — two sites that appear to exactly match what’s needed. I doubt YouTube has the resources or can spare the attention to tackle this, but Google has the resources and the researchers — after all they did attract Kai-Fu Lee last year.