In my presentation on Monday I suggested we could build large (millions of hours) public speech corpora if we thought of the problem from an Internet and community software point of view. I explained this with allusions to Google's use of links to compute page-rank, Amazon success with volunteer reviewers and Wikipedia's community created encyclopedia. And I went so far as to suggest one specific approach based on videos from camcorder owners. Look back at my earlier blog entry for details.
Last night I had a very enjoyable discussion with Skip Cave, Chief Scientist at Intervoice. Skip had missed my presentation on Monday, so at one point I ran over what I'd said.. When I described my camcorder approach, Skip immediately suggested a podcasting approach might be an even faster way to generate a large, accurately transcribed speech corpus.
Skip's suggested offering free services for podcasters including the ability to have a machine transcription generated for each audio file in their podcasting feed. Combine the transcriptions with a wiki-like user interface so the podcaster, or any listener/reader who views the transcription, can easily flag, and optionally correct, any errors in the machine transcription. Having a machine transcription would be a major benefit to podcasters and their audience as it would facilitate rapidly locating specific subjects in an audio file. And given an easy way to correct transcription errors, the podcaster and/or their audience would likely do the necessary editing. If the user interface software noted how much of the file was examined by anyone who was motivated enough to make a correction, it would be possible to flag which machine transcribed content had been "proof read". As an extra safety, one could require transcriptions be checked by several different users before being judged correct.
While there are fewer podcasters than camcorder users today, the podcasting community is growing far more rapidly. Also podcasters and their audiences are a lot more Internet-savvy, so Skip is onto something. If we're seeking millions of hours of correctly transcribed speech, enlisting help from podcasters and their audiences could get us there more rapidly than working with amateur videos taken by camcorder owners.
In any event, what's important is the concept. Think of interesting community projects where a large, evolving speech corpus is a byproduct of something else that participants value. Your suggestions are encouraged. Use the comment form below or take your idea to the venue of your choice.
Hi Brough,
Excellent insights!
However, one thing to note is that with podcasts, people generally use some sort of lossy compression (e.g mp3) for their audio files. Current Free or Open Source Speech Recognition Engines ("SRE"s) need acoustic models trained on uncompressed (e.g. wav) or lossless compressed (e.g flac) audio. If you train an acoustic model with MP3 audio, the SRE will (in theory at least ...) work reasonably well to recognize other MP3 speech audio, but will not work so well with telephony-based speech. You would need to get the podcast audio *before* it is transcribed to mp3 to get good speech audio for training Acoustic Models.
In addition, SREs need speech recorded at the same sample rate and sample resolution (bits per sample) as the speech to be recognized. For example, if you got lots of uncompressed podcast audio trained at a 32kHz sample rate and at 16 bits per sample, and trained an acoustic mode with it, you could not then turn around and try to recognize telephony speech, which uses an 8kHz sampling rate at 8 bits per sample. A way around this is to downsample your speech audio to the rate of the speech to be recognized, and then train your Acoustic Models on this downsampled audio.
This is the approach we are using at VoxForge (www.voxforge.org). Although we are currently asking users to submit 'read speech' from prompts from our site, I agree that uncompressed, or lossless compressed, podcast speech audio might be an excellent source of transcribed speech.
all the best,
Ken MacLean
Posted by: Ken MacLean | February 05, 2007 at 09:27 PM
Thanks for the comment! and the pointer to Voxforge.
I admit I'm not current, but I did a lot of audio compression DSP code in the 80s including down sampling and re-sampling, so I guess I was blithely assuming that one would gather the best audio you could get and then resample it and/or band limit it to match the kind of audio you were trying to recognize. In the specific case of telephony speech (G.711), I'd assume the MP3 in most podcasts is vastly better than G.711 and therefore one could use MP3 podcast audio to train a telephony recognizer. But I've never done this and I have been out of the field for more than a decade.
Do you have (or know anyone who has) experience (pro or con) in resampling MP3 audio to train a telephony recognizer?
Posted by: brough | February 06, 2007 at 10:13 AM
Sorry, I don't have any personal experience in resampling MP3 audio to train Acoustic Models ('AM's) for a telephony recognizer, nor do I know of anyone who has tried.
It may be that even though MP3's lossy compression may not be the best source of audio for the creation of AM's, it may be 'good enough' for recognizing low bandwidth audio in telephony environments. You can get pretty high sampling rates and bit resolutions in MP3 files (with proportionally larger file sizes) that may be enough to overcome the "lossy'ness" of MP3 audio.
However, speech recognition in low bandwidth environments such as telephony is difficult enough as it is with good quality audio. Even with good quality audio, you still need (at the very minimum) over a 100 hours of speech to create a somewhat reasonable AM (commercial AMs use hundreds of hours of speech). And remember, training an Acoustic Model with a large audio set can take many hours or even days!
To experiment with MP3 audio to see if it would be suitable for telephony recognition might be more work than it is worth, given the current state of Free and Open Source Speech Recognition Engines (such as Sphinx, Julius and ISIP, and HTK). Especially if other good quality audio sources can be found - (see this link: http://www.dev.voxforge.org/wiki/AudioSources ).
It may be that developers of Free and Open Source Speech Recognition Engines need to rethink their approach to the way they train the AMs so that they can train using audio from the web's vast repository of non-optimal audio sources like podcasts. Might be a good PHD research project for someone...
all the best,
Ken
--
http://www.voxforge.org
Posted by: Ken MacLean | February 06, 2007 at 11:53 AM
Further on your question about the use of MP3 audio in the creation of Acoustic Models (AM) for Speech Recognition, David Gelbart (who knows a lot more about speech recognition than I do) has a post on the VoxForge site where he mentions that Udhyakumar Nallasamy has done some experiments on the effect of MP3 coding on speech recognition.
So I think it was premature of me to discount the use of MP3 audio for AM training. At VoxForge, we will likely keep working with uncompressed (or lossless compressed) audio for the foreseeable future, but when that well dries up, we'll likely start looking at other other non-traditional audio sources, such as mp3 podcasts, etc...
Keep up the great prognosticating!
Ken
* David Gelbart: http://www.icsi.berkeley.edu/~gelbart
* His post: http://www.voxforge.org/home/forums/message-boards/audio-discussions/comments-on-a-good-acoustic-model-needs-to-be-trained-with-speech-recorded-in-the-environment-it-is-targeted-to-recognize
* Udhyakumar Nallasamy: http://udhyakumar.tripod.com
Posted by: Ken MacLean | February 10, 2007 at 01:14 AM