Please login to be able to save your searches and receive alerts for new content matching your search criteria.
Speech enhancement methods differ depending on the degree of degradation and noise in the speech signal, so research in the field is still difficult, especially when dealing with residual and background noise, which is highly transient. Numerous deep learning networks have been developed that provide promising results for improving the perceptual quality and intelligibility of noisy speech. Innovation and research in speech enhancement have been opened up by the power of deep learning techniques with implications across a wide range of real time applications. By reviewing the important datasets, feature extraction methods, deep learning models, training algorithms and evaluation metrics for speech enhancement, this paper provides a comprehensive overview. We begin by tracing the evolution of speech enhancement research, from early approaches to recent advances in deep learning architectures. By analyzing and comparing the approaches to solving speech enhancement challenges, we categorize them according to their strengths and weaknesses. Moreover, we discuss the challenges and future directions of deep learning in speech enhancement, including the demand for parameter-efficient models for speech enhancement. The purpose of this paper is to examine the development of the field, compare and contrast different approaches, and highlight future directions as well as challenges for further research.
In this paper, a speech enhancement method using noise classification and Deep Neural Network (DNN) was proposed. Gaussian mixture model (GMM) was employed to determine the noise type in speech-absent frames. DNN was used to model the relationship between noisy observation and clean speech. Once the noise type was determined, the corresponding DNN model was applied to enhance the noisy speech. GMM was trained with mel-frequency cepstrum coefficients (MFCC) and the parameters were estimated with an iterative expectation-maximization (EM) algorithm. Noise type was updated by spectrum entropy-based voice activity detection (VAD). Experimental results demonstrate that the proposed method could achieve better objective speech quality and smaller distortion under stationary and non-stationary conditions.
An important stage in speech enhancement is to estimate noise signal which is a difficult task in non-stationary and low signal-to-noise conditions. This paper presents an iterative speech enhancement approach which requires no prior knowledge of noise and is based on low-rank sparse matrix decomposition using Gammatone filterbank and convex distortion measure. To estimate noise and speech, the noisy speech is decomposed into low-rank noise and sparse-speech parts by enforcing sparsity regularization. The exact distribution of noise signals and noise estimator is not required in this approach. The experimental results demonstrate that our approach outperforms competing methods and yields better overall speech quality and intelligibility. Moreover, composite objective measure reinforced a better performance in terms of residual noise and speech distortion in adverse noisy conditions. The time-varying spectral analysis validates significant reduction of the background noise.
Deep neural network (DNN) has recently been successfully adopted as a regression model in speech enhancement. Nonetheless, training machines to adapt different noise is a challenging task. Because every noise has its own characteristics which can be combined with speech utterance to give huge variation on which the model has to operate on. Thus, a joint framework combining noise classification (NC) and speech enhancement using DNN was proposed. We first determined the noise type of contaminated speech by the voice activity detection (VAD)-DNN and the NC-DNN. Then based on the noise classification results, the corresponding SE-DNN model was applied to enhance the contaminated speech. In addition, in order to make method simpler, the structure of different DNNs was similar and the features were the same. Experimental results show that the proposed method effectively improved the performance of speech enhancement in complex noise environments. Besides, the accuracy of classification had a great influence on speech enhancement.
The a priori signal-to-noise ratio (SNR) plays an essential role in many speech enhancement systems. Most of the existing approaches to estimate the a priori SNR only exploit the amplitude spectra while making the phase neglected. Considering the fact that incorporating phase information into a speech processing system can significantly improve the speech quality, this paper proposes a phase-sensitive decision-directed (DD) approach for the a priori SNR estimate. By representing the short-time discrete Fourier transform (STFT) signal spectra geometrically in a complex plane, the proposed approach estimates the a priori SNR using both the magnitude and phase information while making no assumptions about the phase difference between clean speech and noise spectra. Objective evaluations in terms of the spectrograms, segmental SNR, log-spectral distance (LSD) and short-time objective intelligibility (STOI) measures are presented to demonstrate the superiority of the proposed approach compared to several competitive methods at different noise conditions and input SNR levels.
Aiming at the problem of auditory negative enhancement of typical phase reconstruction method, an improved method of phase reconstruction and MMSE-LSA estimation is proposed. First, the geometric relationship between noisy speech and clean speech in unvoiced segment is used to estimate the phase of the clean speech; Second, considering the randomness of speech appearance in the actual noise environment, a modified MMSE-LSA amplitude estimation is proposed by using the binary hypothesis model. Finally, the new phase reconstruction in voiced and unvoiced speech is combined with the modified MMSE-LSA. The simulation results show that the performance of the algorithm proposed in this paper is better than typical phase reconstruction method in terms of the SegSNR and PESQ.
In the present work, we investigate the performance of a number of traditional and recent speech enhancement algorithms in the adverse non-stationary conditions, which are distinctive for motorcycles on the move. The performance of these algorithms is ranked in terms of the improvement they contribute to the speech recognition accuracy, when compared to the baseline performance, i.e. without speech enhancement. The experiments on the MoveOn motorcycle speech and noise database indicated that there is no equivalence between the ranking of algorithms based on the human perception of speech quality and the speech recognition performance. The Multi-band spectral subtraction method was observed to lead to the highest speech recognition performance.
In this paper, the performance evaluation of Modified Cascaded Median (MCM)-based noise estimation method for speech enhancement system has been carried out. The MCM-based method, though reported earlier, was not extensively evaluated; particularly, its real-time performance had not been considered. In the present study, the performance of the MCM-based noise estimation method has been compared with those based on Dynamic Quantile Tracking (DQT) and Cascaded Median (CM), through simulation as well as real-time implementation using TMS320C6416T DSK. All comparisons were made for speech quality (subjectively — mean opinion score and objectively — PESQ score, log-likelihood ratio, weighted spectral slope distance, segmented signal-to-noise ratio and composite measures for signal distortion CSIG, background intrusiveness CBAK and overall distortion COVL) at the 95% level of confidence. The real-time parameters such as memory consumption and execution time have been used for real-time implementation and compared for the three methods. The results, for different SNR-based degraded speech signals, show that the modified cascaded median-based noise estimation is the best in terms of PESQ score, CSIG, CBAK, COVL and mean opinion score. On the other hand, for different noise corrupted-based speech signals, it performs well as compared to the original CM. Memory consumption and average execution time for the MCM-based noise estimation lie in-between those for DQT and CM-based methods.
In general, the background noise degrades the speech quality. Thus, the intelligibility of the speech can be enhanced by mitigating the effects of background noise and echo suppression. So, speech enhancement can also be viewed as one of the optimization problems. In this work, directed search optimization (DSO) method is used to enhance the speech quality which is originally degraded. The performance of DSO-based speech enhancement method is compared with particle swarm optimization (PSO) and least mean square (LMS)-based methods in terms of output average segmental SNR and speech quality. From the experimental results, it was observed that the output spectrogram, output ASSNR and speech quality using DSO algorithm are far better as compared to PSO and LMS-based methods. Moreover, DSO-based method is computationally less complex as compared to the PSO-based method.
In this paper, the performance of compressive sensing (CS)-based technique for speech enhancement has been studied and results analyzed with recovery algorithms as a comparison of their performances. This is done for several recovery algorithms such as matching pursuit, orthogonal matching pursuit, stage-wise orthogonal matching pursuit, compressive sampling matching pursuit and generalized orthogonal matching pursuit. Performances of all these greedy algorithms were compared for speech enhancement. The evaluation of results has been carried out using objective measures (perceptual evaluation of speech quality, log-likelihood ratio, weighted spectral slope distance and segmental signal-to-noise ratio), simulation time and composite objective measures (signal distortion CSIG, background intrusiveness CBAK and overall quality COVL). Results showed that the CS-based technique using generalized orthogonal matching pursuit algorithm yields better performance than the other recovery algorithms in terms of speech quality and distortion.
Speech processing is an important application area of digital signal processing that helps examine and analyze the speech signal. In this processing, speech enhancement is an essential factor because it improves the quality of the signal that helps resolve the communication challenges. Different speech enhancement algorithms are utilized in the research field, but limited processing capabilities, maximum microphone distance, and voice-first I.O. interfaces create the computation complexity. In this paper, speech enhancement is done in two steps. In an initial step, spectral subtraction method is applied to LJ Speech dataset. In the first stage, noise spectrum is estimated during pauses and it is subtracted from the noisy speech signal to obtain the clean speech signal. However, spectral subtraction method still introduces artificial noise and narrow-band noise in the spectrum. Hence, artificial bandwidth expansion with a deep shallow convolution neural network (ABE-DSCNN) is implemented as a second stage in the paper. Further, developed system is compared with conventional enhancement approaches such as deep learning network (DNN), neural beam forming (NB) and generative adversarial network (GAN). The experimental results show that an ABS-DSCNN provides 4% increase of PSEQ and error rate improved by 40% to 56% with respect to the other existing algorithms for 1000 speech samples. Hence, the paper concludes that ABE-DSCNN approach effectively improves the speech quality.
Estimating noise-related parameters in unsupervised speech enhancement (SE) techniques is challenging in low SNR and non-stationary noise environments. In the recent SE approaches, the best results are achieved by partitioning noisy speech spectrograms into low-rank noise and sparse speech parts. However, a few limitations reduce the performance of these SE methods due to the use of overlap and add in STFT process, noisy phase, due to inaccurate estimation of low rank in nuclear norm minimization and Euclidian distance measure in the cost function. These aspects can cause a loss of information in the reconstructed signal when compared to clean speech. To solve this, we propose a novel wavelet-based weighted low-rank sparse decomposition model for enhancing speech by incorporating a gammatone filter bank and Kullback–Leibler divergence. The proposed framework differs from other strategies in which the SE is carried entirely in time domain without the need for noise estimation. Further, to reduce the word error rate, these algorithms were trained and tested on a typical automatic speech recognition module. The experimental findings indicate that the proposed cascaded model has shown significant improvement under low SNR conditions over individual and traditional methods with regard to SDR, PESQ, STOI, SIG, BAK and OVL.
In today’s scientific epoch, speech is an important means of communication. Speech enhancement is necessary for increasing the quality of speech. However, the presence of noise signals can corrupt speech signals. Thereby, this work intends to propose a new speech enhancement framework that includes (a) training phase and (b) testing phase. The input signal is first given to STFT-based noise estimate and NMF-based spectra estimate during the training phase in order to compute the noise spectra and signal spectra, respectively. The obtained signal spectra and noise spectra are then Wiener-filtered, then empirical mean decomposition (EMD) is used. Because the tuning factor of Wiener filters is so important, it should be computed for each signal by coaching in a fuzzy wavelet neural network (FW-NN). Subsequently, a bark frequency is computed from the denoised signal, which is then subjected to FW-NN to identify the suitable tuning factor for all input signals in the Wiener filter. For optimal tuning of η, this work deploys the fitness-oriented elephant herding optimization (FO-EHO) algorithm. Additionally, an adaptive Wiener filter is used to supply EMD with the ideal tuning factor from FW-NN, producing an improved speech signal. At last, the presented approach’s supremacy is proved with varied metrics.
In recent years, the field of speech enhancement has greatly benefited from the rapid development of neural networks. However, the requirement for large amounts of noisy and clean speech pairs for training limits the widespread use of these models. Wavelet network-based speech enhancement typically relies on clean speech signals as a training target. This paper presents a new method that combines a neural network with the wavelet theory for speech enhancement without the need for clean speech signals as targets in training mode. Five wide evaluation criteria, namely short-time objective intelligibility (STOI), signal-to-noise ratio (SNR), segmental signal-to-noise ratio (SNRseg), weighted spectral slope (WSS) and logarithmic spectral distance (LSD), have been used to confirm the effectiveness of the proposed method. The results show that the proposed method performs similar to a wavelet neural network (WNN) trained with clean signals, or even superior to those obtained from the clean target-based strategies.
In this study, we introduce a novel approach to speech enhancement through the design of a complex temporal convolutional network (Complex-TCN). This model leverages the power of complex networks, enabling the simultaneous capture of both magnitude and phase information inherent in speech signals. By employing a temporal convolutional network, the Complex-TCN excels at extracting contextual information within the time domain of speech. Our findings underscore the substantial performance improvements achieved through the synergistic use of the temporal convolutional network and the incorporation of complex representations.
Speech signal often gets corrupted by different noises like airport noise, station noise, and street noise. These noises tend to degrade the quality of the speech signal, particularly in voice communication, automatic speech recognition, and speaker identification. Therefore, it is necessary for automatic speech enhancement. In this research work, a novel speech signal enhancement model is introduced with the assistance of deep learning. The proposed model includes three major phases: (a) pre-processing, (b) feature extraction, and (c) speech enhancement. In the pre-processing phase, the framing will be carried out using the Hanning window, where the input speech signals will be decomposed into a series of overlapping frames. Then, from these individual frames, the multi-features like the improved Mel-frequency cepstral coefficients (IMFCCs), fractional delta AMS, and modified STFT (M-STFT) will be extracted. Subsequently, in the speech enhancement phase, the available noise is estimated initially, and it is removed. The noise removed signals from the frames are used to determine the optimal mask of all the frames of the noisy speech signal, and the mask is employed for training the Deep Convolutional Neural Network (DCNN). The reconstructed outcomes from DCNN are the enhanced speech signal. Finally, the proposed work (multi-features+ DCNN-based Speech Enhancement) is validated over existing models in terms of certain measures, which exhibits the supremacy of the proposed work.
An auditory model has been developed for an intelligent speech information acquisition system in real-world noisy environment. The developed mathematical model of the human auditory pathway consists of three components, i.e. the nonlinear feature extraction from cochlea to auditory cortex, the binaural processing at superior olivery complex, and the top-down attention from higher brain to the cochlea. The feature extraction is based on information-theoretic sparse coding throughout the auditory pathway. Also, the time-frequency masking is incorporated as a model of the lateral inhibition in both time and frequency domain. The binaural processing is modeled as the blind signal separation and adaptive noise canceling based on the independent component analysis with hundreds of time-delays for noisy reverberated signals. The Top-Down (TD) attention comes from familiarity and/or importance of the sensory information, i.e. the sound, and a simple but efficient TD attention model had been developed based on the error backpropagation algorithm. Also, the binaural processing and top-down attention are combined for speech signals with heavy noises. This auditory model requires extensive computing, and special hardware had been developed for real-time applications. Experimental results demonstrate much better recognition performance in real-world noisy environments.
A particular quintet singular valued decomposition (Quintet-SVD) is introduced in this paper via empirical mode decompositions (EMDs). The Quintet-SVD results in four specific orthogonal matrices with a diagonal matrix of singular values. Furthermore, this paper shows relationships between the Quintet-SVD and traditional SVD, generalized low rank approximations of matrices (GLRAM) of one single matrix, and EMDs. One application of the Quintet-SVD for speech enhancement is shown and compared with an application of traditional SVD.
In this paper, we present a modified crosstalk resistant adaptive noise canceller approach. This method proceeds in two steps: First, the signal-to-crosstalk ratio (SCR) is estimated. And when the estimated SCR is less than 6dB,we suggest reconstructing a new reference signal from two original microphone signals. The second step, the basic crosstalk resistant adaptive noise canceller (CTRANC) is used to suppress noises and enhance speech. A comparative study with basic CTRANC shows the superiority of the new method in performance.
A new algorithm for speech enhancement based on wavelet shrinkage method is presented in this paper. First, the noisy speech by the Bark-scaled Wavelet Packet (BS-WPD) is decomposed to simulate the human auditory characteristics. Then a new thresholding algorithm which has many advantages over soft and hard thresholdings put forward by D.L. Donoho and I.M. Johnstone is proposed. Simulation results indicate that this new method is very useful and efficient in the process of white noise reduction from speech, and the new thresholding algorithm gives better SNR improvement than other traditional thresholding algorithms.