Non‐intrusive speech quality assessment using multi‐resolution auditory model features for degraded narrowband speech
2015; Institution of Engineering and Technology; Volume: 9; Issue: 9 Linguagem: Inglês
10.1049/iet-spr.2014.0214
ISSN1751-9683
AutoresRajesh Kumar Dubey, Arun Kumar,
Tópico(s)Acoustic Wave Phenomena Research
ResumoIET Signal ProcessingVolume 9, Issue 9 p. 638-646 Research ArticleFree Access Non-intrusive speech quality assessment using multi-resolution auditory model features for degraded narrowband speech Rajesh Kumar Dubey, Corresponding Author Rajesh Kumar Dubey rajeshk_dubey@yahoo.com Center for Applied Research in Electronics, Indian Institute of Technology-Delhi, Hauz-Khas, New Delhi, 110016 India Jaypee Institute of Information Technology, Noida, IndiaSearch for more papers by this authorArun Kumar, Arun Kumar Center for Applied Research in Electronics, Indian Institute of Technology-Delhi, Hauz-Khas, New Delhi, 110016 IndiaSearch for more papers by this author Rajesh Kumar Dubey, Corresponding Author Rajesh Kumar Dubey rajeshk_dubey@yahoo.com Center for Applied Research in Electronics, Indian Institute of Technology-Delhi, Hauz-Khas, New Delhi, 110016 India Jaypee Institute of Information Technology, Noida, IndiaSearch for more papers by this authorArun Kumar, Arun Kumar Center for Applied Research in Electronics, Indian Institute of Technology-Delhi, Hauz-Khas, New Delhi, 110016 IndiaSearch for more papers by this author First published: 01 December 2015 https://doi.org/10.1049/iet-spr.2014.0214Citations: 15AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onFacebookTwitterLinkedInRedditWechat Abstract A multi-resolution framework using auditory perception-based wavelet packet transform is invoked in multi-resolution auditory model (MRAM) and used for non-intrusive objective speech quality estimation. The MRAM provides a detailed time-frequency modelling of the human auditory system compared to earlier models that have been used for non-intrusive speech quality estimation. The objective Mean Opinion Score (MOS) of a degraded narrowband speech utterance has been estimated by Gaussian Mixture Model (GMM) probabilistic approach using MRAM-based feature vector. Additionally, a recent auditory model (Lyons' auditory model) based features, mel-frequency cepstral coefficients (MFCC), and line spectral frequencies (LSF) features have also been used independently for comparison of the performance of MRAM features. The combination of MFCC and LSF features with MRAM features for non-intrusive speech quality estimation using GMM probabilistic approach has been proposed and investigated. The performance of these feature vectors has been evaluated and compared with ITU-T Recommendation P.563 and a recent published work by computing correlation coefficient and root-mean-square error between the subjective MOS and the estimated objective MOS. It is found that the proposed method that uses a combination of MRAM features, MFCC, and LSF feature vectors for non-intrusive speech quality performs better than both the other algorithms. 1 Introduction Modern telecommunication networks such as mobile communications, voice over internet protocol, and conventional telecommunication networks require the estimation of speech quality at different nodes of the networks for system design, development, monitoring, and maintenance of the Quality of Service (QoS). The performance evaluation of speech processing algorithms, especially speech coders, automatic speech and speaker recognition systems, and text-to-speech synthesis systems also require the evaluation of speech quality. A direct method for speech quality evaluation is to perform subjective listening tests using the Absolute Category Rating (ACR) method in compliance with the ITU-T Recommendation P.800 [1]. However, this approach is not suitable for automatic speech quality estimation as would be invariably required by most of the above applications and also it is time consuming and expensive. Thus, objective speech quality evaluation methods are becoming more important not only to supplement the subjective tests but also to replace them as reliable alternative methods. Objective speech quality evaluation algorithms are broadly classified into two categories: intrusive (or double-sided) and non-intrusive (or single-sided). The difference in the principles of intrusive and non-Intrusive speech quality evaluation algorithms and their comparison with subjective assessment is depicted in Fig. 1. Intrusive speech quality evaluation algorithms use the original clean speech as a reference for comparison with the degraded speech to estimate the speech quality of the degraded speech. It has limited applicability to cases where the original clean speech sentence is available in addition to the degraded speech utterance requiring quality assessment. Non-intrusive speech quality assessment algorithms on the other hand do not require the original clean speech utterance as a reference and depend only on the received (degraded) speech utterance to estimate its quality [2]. Fig. 1Open in figure viewerPowerPoint Intrusive, non-intrusive, and subjective assessment of speech quality The ITU-T Recommendation P.563 of May 2004 is a standard for non-intrusive speech quality assessment method [3]. In another method, an auditory model, utilising the temporal envelope representation of speech is used for non-intrusive speech quality estimation. In this method, auditory non-intrusive quality estimation (ANIQUE) is based on the functional roles of human auditory systems and the characteristics of human articulation systems [4]. Another non-intrusive speech quality assessment method using Gaussian Mixture Model (GMM) of different features obtained from speech coders without considering any degradation model is explained in [5]. The goal of objective estimation of speech quality, which is referred to as the Mean Opinion Score-Listening Quality Objective (MOS-LQO) or the objective MOS, is to give a score that correlates well with the Mean Opinion Score-Listening Quality Subjective (MOS-LQS) or the subjective MOS. This is typically done by explicit or implicit modelling of the processes in the human auditory system leading to the brain to give an opinion score of the speech utterance. The classical auditory models such as Lyon's auditory model, auditory image model etc. when used in the speech quality estimation problem, take into account the effect of simultaneous masking of the human auditory system [6], by modelling the compression stage of the human auditory system consisting of four cascaded automatic gain control (AGC) stages [7]. The AGC attenuates the input signal into a limited dynamic range of basilar membrane (BM) motion with short time constants. However, these models do not incorporate the effect of time-domain temporal masking, i.e., forward and backward masking, where masker and maskee are not present in the human auditory system simultaneously [8]. These masking phenomena also play an important role in a subject's hearing and judgment for giving the quality opinion score for degraded speech utterances [9]. This work explores the benefits of time-frequency domain processing of multi-resolution auditory model (MRAM). The MRAM resemble the human auditory system, which takes into account both the frequency-domain simultaneous masking and the effect of time-domain temporal masking [10]. It is hypothesised that the effect of temporal masking also has a role in a subject's judgment for giving the quality opinion score for noisy speech utterances. In this work, the kinds of distortion considered are: (1) codecs and transcoding, (2) MNRU, and (3) additive noise. For additive noise temporal masking is more relevant because the additive noise can potentially do temporal masking. In particular, it is likely that the temporal masking effects are more significant for speech utterances having additive noise characteristics, when the speech quality assessment problem is considered. The frequency dependence of the temporal masking effects is incorporated in MRAM features used in the proposed algorithm for non-intrusive speech quality assessment to estimate MOS-LQO. The model is based on wavelet packet decomposition according to different critical band and it transforms the input speech stimuli into an internal representation of the human auditory system that encompasses perceptually relevant features, for speech quality estimation [11]. The MRAM features have also been combined with mel-frequency cepstral coefficients (MFCC) and line spectral frequencies (LSF) that have been used in previous objective speech quality evaluation algorithms, which further improves the correlation between the subjective MOS and the objective MOS scores [12]. To resemble the cognitive mapping function of the human brain, probabilistic modelling (Gaussian Mixture Modelling) is done using these auditory feature vectors that maps the auditory features to the objective MOS score for the speech utterances. The organisation of the rest of the paper is as follows: Section 2 describes the different feature vectors that have been combined to study the objective speech quality evaluation algorithm. These feature vectors include MRAM features, MFCC, and LSF features. In Section 3, the GMM probabilistic approach is described which is used for the speech quality estimation problem as a conditional probability density function. Section 4 gives the description of databases, the performance evaluation results of the speech quality estimation algorithms using different feature vector combinations, and makes observations on the results. Section 5 concludes the paper. 2 Features for speech quality assessment This section describes the different feature vectors for the speech quality assessment algorithm and how these different types of feature vectors are combined [12] to make a feature vector to improve the correlation between the subjective MOS and the estimated objective MOS is shown in Fig. 2. These feature vectors and their combinations are used for non-intrusive speech quality assessment. In the combination of feature vectors used, MRAM features capture auditory masking characteristics, MFCC features model perceptual domain processing characteristics, and LSF features model the vocal tract resonances of the speech production model. The corresponding feature vectors model different aspects of speech production and audition but will also have correlations. Due to these correlations, dimensionality reduction has been done using principal component analysis (PCA) and it is found to be effective. Fig. 2Open in figure viewerPowerPoint Feature vector computation and their combination 2.1 MRAM features The multi-resolution auditory excitation pattern is produced by the MRAM model at different time-frequency resolutions that match with the perceptual frequency scale for each speech input and is used as feature vector for speech quality evaluation. The multi-resolution framework is used to encompass the important auditory phenomena of frequency dependent temporal masking in the model that is not incorporated in traditional bank-of-filters or spectral domain auditory models. The different auditory phenomenon that are incorporated in the MRAM model include non-linear transformation of frequency and amplitude scales, absolute hearing threshold (AHT), perceptual frequency scale, spectral spreading and integration, temporal smearing process of the ear, and the effects of frequency-domain simultaneous masking and temporal masking phenomena. The model performs outer and middle ear (OME) weighting, multi-resolution spectral spreading, and multi-resolution spectral smearing. Power law compression is applied to incorporate the effect of subjective loudness in the model. The wavelet for the wavelet packet decomposition is designed based on the critical band structure [10], so that it can transfer close to the true energy corresponding to each critical band channel for the given input speech signal. The MRAM transforms the input speech stimuli into auditory excitation pattern of the internal representation at the output of the human peripheral auditory system [11]. The energy of the input speech signal is decomposed into critical band channels using wavelet packet transform (WPT) for several time–frequency resolutions that help in including perceptually important events of the speech signal such as short duration transients [13]. The computation of MRAM features is depicted in Fig. 3. The degraded input speech utterance is passed through voiced activity detection algorithm [14] to obtain the active speech and remove the silence region. The MRAM features are computed on frame-by-frame basis [11] for the active part of the speech only. The active part of a speech utterance will have the maximum influence on a listener's perception about the quality, as the silence part within an utterance are typically small, and consequently do not contribute to improve performance. The windowed frame of the active speech is passed through the next level of block for wavelet packet decomposition. The coefficients obtained are squared to compute the energy. The next block is the OME weighting that incorporates the effect of AHT. The next block shows the energy spreading process that is applied to larger time duration for the low-frequency energy components and for shorter time duration for the high-frequency energy components. The temporal smearing process captures the effect of temporal masking. Finally, by subjective loudness adjustment of the intensity of speech sound, the MRAM features are obtained, which is further used for non-intrusive speech quality assessment. Fig. 3Open in figure viewerPowerPoint Block diagram depicting the processing in MRAM For narrowband speech sampled at 8 kHz, the total number of critical bands is 17, distributed into three sets given in Table 1. The first set encompasses critical bands 1–8, the second set covers bands 9–14, and the third set spans bands 15–17. Thus, a total of 17 MRAM features are obtained corresponding to the 17 critical bands for each active speech frame and then their mean, variance, skewness, and kurtosis over all the active speech frames are computed to make a total of 68-dimensional MRAM feature vectors for each degraded speech utterance. PCA is done to reduce the MRAM feature vector dimensionality from 68 to 22 that preserve 98% of the total energy of the 68-dimensional feature vector. Table 1. Critical band (CB) indexed WPT coefficients and their bandwidths Critical band (CB) no. CB-indexed WPT coeffs. DWT coeffs. for different CB Lower frequency, Hz Higher frequency, Hz Central frequency, Hz Bandwidth, Hz 1 b [1,k] CB1 = AAAAA5 0 125 62.5 125 2 b [2,k] CB2 = DAAAA5 125 250 187.5 125 3 b [3,k] CB3 = DDAAA5 250 375 312.5 125 4 b [4,k] CB4 = ADAAA5 375 500 437.5 125 5 b [5,k] CB5 = ADDAA5 500 625 562.5 125 6 b [6,k] CB6 = DDDAA5 625 750 687.5 125 7 b [7,k] CB7 = DADAA5 750 875 812.5 125 8 b [8,k] CB8 = AADAA5 875 1000 937.5 125 9 b [9,k] CB9 = ADDA4 1000 1250 1125 250 10 b [10,k] CB10 = DDDA4; 1250 1500 1375 250 11 b [11,k] CB11 = DADA4 1500 1750 1625 250 12 b [12,k] CB12 = AADA4 1750 2000 1875 250 13 b [13,k] CB13 = ADAD4 2000 2250 2125 250 14 b [14,k] CB14 = ADDD4 2250 2500 2375 250 15 b [15,k] CB15 = ADD3 2500 3000 2750 500 16 b [16,k] CB16 = DAD3 3000 3500 3250 500 17 b [17,k] CB17 = AAD3 3500 4000 3750 500 2.2 Significance of MRAM analysis The most natural speech sounds are characterised by modulations of acoustic energy in both the spectral and temporal domains. These modulations manifest at multiple scales both in the spectral and temporal domains and are critical stimulus dimensions for the processing of speech sounds in the human auditory cortex [15]. The spectral and temporal processing can be modelled as an array of spectro-temporal filters that are selective for combinations of spectral and temporal modulations, whereas the independent representation can be seen as a bank of filters that are selective for either temporal or spectral modulations. The MRAM matches the human auditory system better than the classical fixed resolution auditory models such as [4] and auditory models used in [12]. In particular, the time-resolution at high frequencies and frequency resolution at low frequencies are better captured in MRAM features. Furthermore, the representations of sounds at multiple resolutions may provide the computational basis for binding acoustic elements in sound mixtures and incorporate more complex auditory phenomenon. These aspects are expected to provide a perceptually improved allocation of distortions in the degraded speech signals. In MRAM, the changes of speech perception with additive quasi-stationary noise such as short duration transients, reverberation, and non-linear processing with spectral subtraction can be captured more accurately. The MRAM accounts better for the intelligibility conditions with quasi-stationary noise and fluctuating interferences, and noisy speech distorted by reverberation or spectral subtraction [16]. 2.3 MFCC features A widely used feature representation of the speech signal frame is MFCC that captures variations of BM of the human ear's critical bandwidth with frequency [17]. It uses two types of filters, namely, linear frequency spaced filters below 1000 Hz and logarithmic frequency spaced filters above 1000 Hz. The MFCC feature has been shown to be an effective representation of the perceptual quality of the speech signal [18]. The active speech is segmented into frames of length 16 ms and each individual speech frame of 16 ms duration is windowed using a Hamming window. Thus, 13-dimensional MFCC vector for each frame of the active speech is obtained by performing the DCT of the log-mel spectrum coefficients computed for the frame. The global feature vector of 13 dimensions for each degraded speech utterance is computed by taking the average of the MFCC vector over all active speech frames for each speech utterance [17]. 2.4 Linear prediction coefficients (LPC) and LSF-based features The LSF features also offer an alternative efficient spectral envelope representation form for speech as borne out by its extensive use in speech coding algorithms [19]. They carry intrinsic information of the formant structure which is related to the resonance frequencies of the vocal tract of the speaker during articulation. The active speech is segmented into frames of length 16 ms and each individual active speech frame of 16 ms duration is windowed using a Hamming window. A tenth-order LPC analysis over the frame is done to get 10 LSF for each frame. The global LSF feature vector for the active speech of the speech utterance of dimensionality 10 is obtained by taking the average over all active speech frames. For comparison with a recent auditory model used for speech quality estimation algorithm [12], the 64 channel Lyon's auditory model features have also been used which produces 64-dimensional feature vector. The mean, variance, skewness, and kurtosis are also computed for each channel output over the frame which thus produces a 256-dimensional Lyon's auditory model feature vector for each degraded speech utterance. The PCA is applied to optimally reduce the feature vector dimensionality to 14, which retains 98% of the total energy of the 256-dimensional feature vector. The following sets of feature vectors have been used in the experiments for speech quality estimation algorithm for comparison of the performance of MRAM features: MFCC feature vector of dimension 13; LSF feature vector of dimension 10; Lyon's auditory model feature vector of reduced dimension 14; MRAM feature vector of reduced dimension 22; MRAM feature vector of reduced dimension 22, MFCC feature vector of dimension 13, and LSF feature vector of dimension 10, combined together for each speech utterance to give a 45-dimensional feature vector. 3 GMM probabilistic approach for objective MOS computation The computed feature vectors and their combinations, as described in the previous section, are used for the objective MOS mapping using the GMM probabilistic approach. To obtain the parameters of GMM Π(μ(k), ω(k), ∑(k)) with k = 1, 2, 3, …, K mixture components, and where μ(k), ω(k), and ∑(k)) are the mean, mixture weight, and covariance matrix respectively of the k th mixture component. The training of GMM has been done using large size speech databases having speech utterances with subjective MOS score. The subjective MOS score θj from MOS labelled database is appended to the feature vectors Ψ computed for the training of a joint GMM as shown in Fig. 4 using the Expectation Maximisation algorithm [20]. Thus, [Ψ j, θj] is used for the training of joint GMM for the j th training speech utterance, which consists of the reduced feature vector Ψ j and the subjective MOS score θj of the speech utterance, where j = 1, 2, 3, …, J is the number of speech utterances used for the training of the joint GMM. Fig. 4Open in figure viewerPowerPoint Block diagram illustrating GMM training steps Now, the aim is to get an objective MOS estimator for the quality of speech utterance as a function of the reduced size feature vector Ψ, i.e., and given the trained joint GMM parameters Π(μ(k), ω(k), ∑(k)) as shown in Fig. 5. The objective MOS estimate is obtained using the MMSE criterion [5] given by (1) Fig. 5Open in figure viewerPowerPoint Block diagram illustrating GMM-based objective MOS estimation The modelling of the joint density function of the feature vector variables along with the subjective MOS scores as a GMM is given by (2) where is the multivariate Gaussian densities, with μ(k) being the mean vector and Σ(k) the covariance matrix of the k th mixture Gaussian density component. Then, the objective MOS estimator will be given by (3) where (4) and (5) where , and are the means, covariance, and cross-covariance matrices of Ψ and θ. The experiment has been done to observe the effects of number of mixture components for K = 8, 12, and 16 on the mapping of the objective MOS using GMM and found that K = 12 gives better correlation between the subjective MOS and the mapped objective MOS. Thus, in this work, K = 12 mixture components with full covariance matrices are used for the modelling of the probability density function. 4 Results and discussion In this section, databases are described and experimental results are presented in terms of correlation between the subjective and the estimated objective MOS. Some inferences are made from the computed results and also, results are compared with a recent published work [21]. 4.1 Database description Three databases namely ITU-T P. Supplement-23 database [22], NOIZEUS-2240 database, and NOIZEUS-960 database [23] are used in the study. The ITU-T P. Supplement-23 is a database of 1328 speech utterances each of duration 8 s and down sampled at 8 kHz at 332 different degradation conditions for which the ACR labelled subjective MOS were available. The different degradation conditions included various classes of coder/channel distortions and random/bursty frame erasures type noise (vehicle, street, and hoth noise) at 20 dB Signal to Noise Ratio (SNR) level for American English, French, Japanese, and Italian languages. The NOIZEUS-2240 is a database of 2240 degraded speech utterances with 112 different degradation conditions. In this database, 20 clean speech utterances each of duration 3 s and sampled at 8 kHz sampling rate are degraded by four different types of noise namely babble, car, street, and train noise at 2 SNR levels of 5 and 10 dB each and passing through 14 different speech enhancement and noise suppression algorithms namely MMSE-STSTA (six algorithms), spectral subtraction (three algorithms), subspace-approach (two algorithms) and Weiner filtering (three algorithms). The NOIZEUS-960 database contains 30 clean speech utterances each of duration 3 s and sampled at 8 kHz sampling rate. Each speech utterance is degraded at four different SNR levels (0, 5, 10, and 15 dB) with eight different types of noise namely airport, babble, car, exhibition, restaurant, station, street, and suburban train resulting 960 degraded speech sentences at 32 different degradation conditions. The NOIZEUS-2240 and NOIZEUS-960 databases of 3300 speech utterances were used for conducting the subjective listening tests in our laboratory by 21 listeners and an average of their opinion score was computed for each speech utterance to obtain the ACR labelled subjective MOS score. The database of 1328 speech utterances of ITU-T P. Supplement-23 databases is first randomised and then 'leave one out' procedure is used for GMM training and the objective MOS computation of the speech utterances. First, the database of 1328 utterances of ITU-T P. Supplement-23 databases is divided into ten subsets, out of which eight subsets are of 133 and two subsets are of 132 speech utterances. Out of these ten subsets, nine subsets are used for training of the GMM, while remaining one subset is used for the objective MOS computation. This procedure is repeated wherein all the ten subsets are considered for the objective MOS computation one-by-one while the remaining nine subsets in each case are used for GMM training, thus generating an objective MOS score for all 1328 speech utterances. The same approach has been followed for NOIZEUS-2240 and NOIZEUS-960 databases for objective MOS computation. In this proposed approach, training and testing in 9:1 ratio (tenfold cross-validation process), with large set of databases has been done and repeated ten times for cross-validation. 4.2 Performance evaluation criteria In the literature, the performance of different speech quality estimation algorithms is generally assessed using Karl Pearson's correlation coefficient between the condition average of the predicted objective speech quality MOS score and the subjective speech quality MOS score θ [2, 4, 5]. In this work, Karl Pearson's correlation coefficient is obtained between the estimated objective speech quality MOS score and the subjective speech quality MOS score θ for both the condition-averaged MOS and unconditioned MOS (without taking the condition average). The root-mean-square error (RMSE) between the estimated objective speech quality MOS score and the subjective MOS score θ are also used as a figure of merit along with the correlation coefficient for the performance evaluation and comparison with a speech quality algorithm in a recent work [21] for both the cases, condition-averaged MOS and unconditioned MOS. 4.3 Condition-averaged MOS versus unconditioned MOS In condition averaged MOS, the average of all the MOS scores is computed over the same 'condition' of degradation of the speech utterances irrespective of the speech utterance. Here, 'condition' refers to speech utterances at the same degradation or passed through the same speech-processing algorithms. Most of the previous work in the literature [2, 4, 5] has compared the results for condition averages of MOS only, but it is not a very realistic measure. Thus, we have proposed the unconditioned case for evaluation of subjective and objective MOS. In unconditioned MOS case, the estimated objective MOS and the listener's subjective MOS for individual speech utterances are used to compute the correlation coefficient on a sentence-by-sentence basis within each condition of degradation instead of simply obtaining the correlation coefficient between the average of the objective and subjective MOS across a condition of degradation. The averaging process reduces the variation that is present within a degradation condition. In practice, one is typically interested in the accuracy of the estimated objective MOS score for a single speech utterance and thus the computation of correlation or RMSE using unconditioned MOS is more realistic. For this reason, the unconditioned MOS performance has also been included in this study. The results are expressed in terms of the correlation coefficients between the condition-averaged estimated objective MOS and the condition-averaged subjective MOS in Table 2 and the correlation coefficients between the unconditioned estimated objective MOS and the unconditioned subjective MOS in Table 3. The standard deviation and confidence interval for all the results are also computed and compared with ITU-T Rec. P.563 to show the efficacy of the method. Table 2. Pearson's correlation coefficients between the condition-averaged subjective MOS and the condition-averaged estimated objective MOS obtained from different sets of feature vectors. Feature vectors used are: (1) 13-dimensional MFCC, (2) 10-dimensional LSF, (3) 14-dimensional Lyon's auditory model features, (4) 22-dimensional MRAM features, (5) the combination of 22-dimensional MRAM features, 13-dimensional MFCC, and 10-dimensional LSF feature vector Data of different expts. No. of speech utterances ITU-T rec. P.563 MFCC feature vector LSF feature vector Lyon feature vector MRAM feature vector MRAM, MFCC, and LSF feature vector 8 kbps ITU & ETSI standard CODECS interworking exp.1(A)-French 176 0.885 0.885 0.919 0.849 0.86 0.938 exp.1(D)-Japanese 176 0.842 0.869 0.821 0.885 0.875 0.916 exp.1(O)-A. English 176 0.902 0.911 0.917 0.91 0.876 0.911 Channel errors and background noise exp.3(A)-French 200 0.867 0.754 0.588 0.783 0.815 0.868 exp.3(C)-Italian 200 0.854 0.817 0.738 0.814 0.803 0.851 exp.3(D)-Japanese 200 0.929 0.855 0.783 0
Referência(s)