In my presentation on Monday I suggested we could build large (millions of hours) public speech corpora if we thought of the problem from an Internet and community software point of view. I explained this with allusions to Google's use of links to compute page-rank, Amazon success with volunteer reviewers and Wikipedia's community created encyclopedia. And I went so far as to suggest one specific approach based on videos from camcorder owners. Look back at my earlier blog entry for details.
Last night I had a very enjoyable discussion with Skip Cave, Chief Scientist at Intervoice. Skip had missed my presentation on Monday, so at one point I ran over what I'd said.. When I described my camcorder approach, Skip immediately suggested a podcasting approach might be an even faster way to generate a large, accurately transcribed speech corpus.
Skip's suggested offering free services for podcasters including the ability to have a machine transcription generated for each audio file in their podcasting feed. Combine the transcriptions with a wiki-like user interface so the podcaster, or any listener/reader who views the transcription, can easily flag, and optionally correct, any errors in the machine transcription. Having a machine transcription would be a major benefit to podcasters and their audience as it would facilitate rapidly locating specific subjects in an audio file. And given an easy way to correct transcription errors, the podcaster and/or their audience would likely do the necessary editing. If the user interface software noted how much of the file was examined by anyone who was motivated enough to make a correction, it would be possible to flag which machine transcribed content had been "proof read". As an extra safety, one could require transcriptions be checked by several different users before being judged correct.
While there are fewer podcasters than camcorder users today, the podcasting community is growing far more rapidly. Also podcasters and their audiences are a lot more Internet-savvy, so Skip is onto something. If we're seeking millions of hours of correctly transcribed speech, enlisting help from podcasters and their audiences could get us there more rapidly than working with amateur videos taken by camcorder owners.
In any event, what's important is the concept. Think of interesting community projects where a large, evolving speech corpus is a byproduct of something else that participants value. Your suggestions are encouraged. Use the comment form below or take your idea to the venue of your choice.