A study of pitch, formant, and spectral estimation errors introduced by three lossy speech compression algorithms
2005; S. Hirzel Verlag; Volume: 91; Issue: 4 Linguagem: Inglês
ISSN
1610-1928
Autores Tópico(s)Advanced Data Compression Techniques
ResumoThis paper quantifies some of the effects of three popular lossy audio compression algorithms on basic acoustic speech analysis procedures by comparing original audio-CD speech recordings to compressed/decompressed versions of these recordings. The differences found are benchmarked against the effects of a change of microphone. Tested are a Sony Minidisc Walkman recorder and two audio compression codecs: Ogg Vorbis 1 .Orc3 and LAME 3.92 (MP3), with 3 bit rates: 40, 80 (Ogg Vorbis), and 192 kbs (MP3). These are tested against pitch and formant extraction and a global spectral measure, the spectral center of gravity (i.e., first spectral moment). Audio compression added only a limited amount of jump errors (≤3%) to vowel pitch and formant measurements. Only small systematic effects on measurements were found that could be attributed to compression. However, rather large systematic effects resulted from a switch of microphone, mostly on the spectral center of gravity. The audio compression algorithms introduced a Root-Mean-Square (RMS) error, after removing jump errors, of less than I semitone to vowel mid-point pitch, formant, and CoG measurements. The effect of the microphone change on RMS error was as large, i.e., for pitch, or larger, i.e.. > 1.2 semitones for formants and center of gravity. Comparison of the pitch in sonorants and the spectral center of gravity measurements in continuants showed that here too, the RMS errors introduced by the audio compression were always less than 1 semitone, except for the lowest bit-rate, 40 kbs, where CoG errors exploded in vowel-like consonants and fricatives (>2 semitones). The size of the errors shows an effect of compression factor (bit-rate). The higher bit-rate encodings always had smaller RMS errors, except for pitch measurements where there was no effect of encoding or bit-rate whatsoever. When audio compression is applied repeatedly, e.g., during recording, distribution, and archiving, the weakest link determines the total RMS error for pitch and formant measurements. However, the total RMS error of the CoG measurements is the sum of the component errors. It is concluded that Minidisc recordings and compressed speech of bit-rates from 80 kbs and up can be used for acoustic speech analysis if an increased RMS error of up to 1 semitone is acceptable. A low bit-rate encoding of 40 kbs introduces markedly larger errors in formant measurements and must be considered unsuitable for whole-spectrum measurements like the CoG. Repeatedly compressed speech is still useful for pitch and formant measurements, but whole spectrum (e.g., CoG) measurements should only be used with care.
Referência(s)