« Disruptive Trends & Speech Technology | Main | Large Speech Corpora - Dan Bricklin's Insights »

August 03, 2005


Ken MacLean

Hi Brough,

Excellent insights!

However, one thing to note is that with podcasts, people generally use some sort of lossy compression (e.g mp3) for their audio files. Current Free or Open Source Speech Recognition Engines ("SRE"s) need acoustic models trained on uncompressed (e.g. wav) or lossless compressed (e.g flac) audio. If you train an acoustic model with MP3 audio, the SRE will (in theory at least ...) work reasonably well to recognize other MP3 speech audio, but will not work so well with telephony-based speech. You would need to get the podcast audio *before* it is transcribed to mp3 to get good speech audio for training Acoustic Models.

In addition, SREs need speech recorded at the same sample rate and sample resolution (bits per sample) as the speech to be recognized. For example, if you got lots of uncompressed podcast audio trained at a 32kHz sample rate and at 16 bits per sample, and trained an acoustic mode with it, you could not then turn around and try to recognize telephony speech, which uses an 8kHz sampling rate at 8 bits per sample. A way around this is to downsample your speech audio to the rate of the speech to be recognized, and then train your Acoustic Models on this downsampled audio.

This is the approach we are using at VoxForge (www.voxforge.org). Although we are currently asking users to submit 'read speech' from prompts from our site, I agree that uncompressed, or lossless compressed, podcast speech audio might be an excellent source of transcribed speech.

all the best,

Ken MacLean


Thanks for the comment! and the pointer to Voxforge.

I admit I'm not current, but I did a lot of audio compression DSP code in the 80s including down sampling and re-sampling, so I guess I was blithely assuming that one would gather the best audio you could get and then resample it and/or band limit it to match the kind of audio you were trying to recognize. In the specific case of telephony speech (G.711), I'd assume the MP3 in most podcasts is vastly better than G.711 and therefore one could use MP3 podcast audio to train a telephony recognizer. But I've never done this and I have been out of the field for more than a decade.

Do you have (or know anyone who has) experience (pro or con) in resampling MP3 audio to train a telephony recognizer?

Ken MacLean

Sorry, I don't have any personal experience in resampling MP3 audio to train Acoustic Models ('AM's) for a telephony recognizer, nor do I know of anyone who has tried.

It may be that even though MP3's lossy compression may not be the best source of audio for the creation of AM's, it may be 'good enough' for recognizing low bandwidth audio in telephony environments. You can get pretty high sampling rates and bit resolutions in MP3 files (with proportionally larger file sizes) that may be enough to overcome the "lossy'ness" of MP3 audio.

However, speech recognition in low bandwidth environments such as telephony is difficult enough as it is with good quality audio. Even with good quality audio, you still need (at the very minimum) over a 100 hours of speech to create a somewhat reasonable AM (commercial AMs use hundreds of hours of speech). And remember, training an Acoustic Model with a large audio set can take many hours or even days!

To experiment with MP3 audio to see if it would be suitable for telephony recognition might be more work than it is worth, given the current state of Free and Open Source Speech Recognition Engines (such as Sphinx, Julius and ISIP, and HTK). Especially if other good quality audio sources can be found - (see this link: http://www.dev.voxforge.org/wiki/AudioSources ).

It may be that developers of Free and Open Source Speech Recognition Engines need to rethink their approach to the way they train the AMs so that they can train using audio from the web's vast repository of non-optimal audio sources like podcasts. Might be a good PHD research project for someone...

all the best,



Ken MacLean

Further on your question about the use of MP3 audio in the creation of Acoustic Models (AM) for Speech Recognition, David Gelbart (who knows a lot more about speech recognition than I do) has a post on the VoxForge site where he mentions that Udhyakumar Nallasamy has done some experiments on the effect of MP3 coding on speech recognition.

So I think it was premature of me to discount the use of MP3 audio for AM training. At VoxForge, we will likely keep working with uncompressed (or lossless compressed) audio for the foreseeable future, but when that well dries up, we'll likely start looking at other other non-traditional audio sources, such as mp3 podcasts, etc...

Keep up the great prognosticating!


* David Gelbart: http://www.icsi.berkeley.edu/~gelbart
* His post: http://www.voxforge.org/home/forums/message-boards/audio-discussions/comments-on-a-good-acoustic-model-needs-to-be-trained-with-speech-recorded-in-the-environment-it-is-targeted-to-recognize
* Udhyakumar Nallasamy: http://udhyakumar.tripod.com

The comments to this entry are closed.

My Photo

Search this Blog

Subscribe by Email

March 2014

Sun Mon Tue Wed Thu Fri Sat
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31          


Site Meter

Upcoming Travel & Conferences

Twitter Feed

Become a Fan