World Scientific
  • Search
  •   
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×
Our website is made possible by displaying certain online content using javascript.
In order to view the full content, please disable your ad blocker or whitelist our website www.worldscientific.com.

System Upgrade on Tue, Oct 25th, 2022 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at [email protected] for any enquiries.

End-to-End Monaural Speech Separation with a Deep Complex U-Shaped Network

    https://doi.org/10.1142/S0218126622500281Cited by:1 (Source: Crossref)

    Conventional time–frequency (TF) domain source separation methods mainly focus on predicting TF-masks or speech spectrums, where complex ideal ratio mask (cIRM) is an effective target for speech enhancement and separation. However, some recent studies employ a real-valued network, such as a general convolutional neural network (CNN) and a recurrent neural network (RNN), to predict a complex-valued mask or a spectrogram target, leading to the unbalanced training results of real and imaginary parts. In this paper, to estimate the complex-valued target more accurately, a novel U-shaped complex network for the complex signal approximation (uCSA) method is proposed. The uCSA is an adaptive front-end time-domain separation method, which tackles the monaural source separation problem in three ways. First, we design and implement a complex U-shaped network architecture comprising well-defined complex-valued encoder and decoder blocks, as well as complex-valued bidirectional Long Short-Term Memory (BLSTM) layers, to process complex-valued operations. Second, the cIRM is the training target of our uCSA method, optimized by signal approximation (SA), which takes advantage of both real and imaginary components of the complex-valued spectrum. Third, we re-formulate STFT and inverse STFT into derivable formats, and the model is trained with the scale-invariant source-to-noise ratio (SI-SNR) loss, achieving end-to-end training of the speech source separation task. Moreover, the proposed uCSA models are evaluated on the WSJ0-2mix datasets, which is a valid corpus commonly used by many supervised speech separation methods. Extensive experimental results indicate that our proposed method obtains state-of-the-art performance on the basis of the perceptual evaluation of speech quality (PESQ) and the short-time objective intelligibility (STOI) metrics.

    This paper was recommended by Regional Editor Tongquan Wei.

    References

    • 1. D. Wang and J. Chen , Supervised speech separation based on deep learning: An overview, IEEE/ACM Trans. Audio, Speech, Lang. Process. 26 (2018) 1702–1726. CrossrefGoogle Scholar
    • 2. A. Aroudi and S. Doclo , Cognitive-driven binaural LCMV beamformer using EEG-based auditory attention decoding, Proc. 2019 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2019), pp. 406–410. CrossrefGoogle Scholar
    • 3. F. Haider and S. Luz , A system for real-time privacy preserving data collection for ambient assisted living, Proc. Interspeech (2019), pp. 2374–2375. Google Scholar
    • 4. Y. He et al., Streaming end-to-end speech recognition for mobile devices, Proc. 2019 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2019), pp. 6381–6385. CrossrefGoogle Scholar
    • 5. K. Han, Y. Wang and D. Wang , Learning spectral mapping for speech dereverberation, Proc. 2014 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), (IEEE, 2014), pp. 4628–4632. CrossrefGoogle Scholar
    • 6. P.-S. Huang, M. Kim, M. Hasegawa-Johnson and P. Smaragdis , Deep learning for monaural speech separation, Proc. 2014 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2014), pp. 1562–1566. CrossrefGoogle Scholar
    • 7. Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, J. Wu, B. Zhang and L. Xie, DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement, arXiv:2008.00264. Google Scholar
    • 8. F. Bahmaninezhad, S.-X. Zhang, Y. Xu, M. Yu, J. H. Hansen and D. Yu, A unified framework for speech separation, arXiv:1912.07814. Google Scholar
    • 9. S.-W. Fu, T.-W. Wang, Y. Tsao, X. Lu and H. Kawai , End-to-end waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks, IEEE/ACM Trans. Audio, Speech, Lang. Process. 26 (2018) 1570–1584. CrossrefGoogle Scholar
    • 10. D. Stoller, S. Ewert and S. Dixon, Wave-u-net: A multi-scale neural network for end-to-end audio source separation, arXiv:1806.03185. Google Scholar
    • 11. Y. Luo and N. Mesgarani , TasNet: Time-domain audio separation network for real-time, single-channel speech separation, Proc. 2018 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2018), pp. 696–700. CrossrefGoogle Scholar
    • 12. Z. Shi, H. Lin, L. Liu, R. Liu, J. Han and A. Shi, FurcaNeXt: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks, arXiv:1902.04891. Google Scholar
    • 13. Y. Luo and N. Mesgarani , Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation, IEEE/ACM Trans. Audio, Speech, Lang. Process. 27 (2019) 1256–1266. CrossrefGoogle Scholar
    • 14. Y. Luo, Z. Chen and T. Yoshioka , Dual-path RNN: Efficient long sequence modeling for time-domain single-channel speech separation, Proc. 2020 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2020), pp. 46–50. CrossrefGoogle Scholar
    • 15. Y. Xu, J. Du, L.-R. Dai and C.-H. Lee , A regression approach to speech enhancement based on deep neural networks, IEEE/ACM Trans. Audio, Speech, Lang. Process. 23 (2014) 7–19. CrossrefGoogle Scholar
    • 16. S. Srinivasan, N. Roman and D. Wang , Binary and ratio time-frequency masks for robust speech recognition, Speech Commun. 48 (2006) 1486–1501. Crossref, ISIGoogle Scholar
    • 17. A. Narayanan and D. Wang , Ideal ratio mask estimation using deep neural networks for robust speech recognition, Proc. 2013 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 7092–7096. CrossrefGoogle Scholar
    • 18. Y. Wang, A. Narayanan and D. Wang , On training targets for supervised speech separation, IEEE/ACM Trans. Audio, Speech, Lang. Process. 22 (2014) 1849–1858. CrossrefGoogle Scholar
    • 19. Y. Wang and D. Wang , Boosting classification based speech separation using temporal dynamics, Proc. Thirteenth Annual Conf. International Speech Communication Association, Portland, Oregon, USA, 9–13 September 2012, pp. 1528–1531. Google Scholar
    • 20. H. Erdogan, J. R. Hershey, S. Watanabe and J. Le Roux , Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks, Proc. 2015 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2015), pp. 708–712. CrossrefGoogle Scholar
    • 21. D. S. Williamson, Y. Wang and D. Wang , Complex ratio masking for monaural speech separation, IEEE/ACM Trans. Audio, Speech, Lang. Process. 24 (2015) 483–492. CrossrefGoogle Scholar
    • 22. W. Zhang, X. Li, A. Zhou, K. Ren and J. Song , pcIRM: Complex ideal ratio masking for speaker-independent monaural source separation with utterance permutation invariant training, Proc. 2020 Int. Joint Conf. Neural Networks (IJCNN), 2020, pp. 1–8. CrossrefGoogle Scholar
    • 23. Y. Liu, H. Zhang, X. Zhang and L. Yang , Supervised speech enhancement with real spectrum approximation, Proc. 2019 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2019), pp. 5746–5750. CrossrefGoogle Scholar
    • 24. W. Zhang, X. Li, A. Zhou, K. Ren and J. Song , Incorporating phase-encoded spectrum masking into speaker-independent monaural source separation, Big Data Res. 22 (2020) 100158. Crossref, ISIGoogle Scholar
    • 25. K. Tan and D. Wang , Complex spectral mapping with a convolutional recurrent network for monaural speech enhancement, Proc. 2019 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2019), pp. 6865–6869. CrossrefGoogle Scholar
    • 26. D. Yin, C. Luo, Z. Xiong and W. Zeng , PHASEN: A phase-and-harmonics-aware speech enhancement network, AAAI, 2020, pp. 9458–9465. CrossrefGoogle Scholar
    • 27. H.-S. Choi, J.-H. Kim, J. Huh, A. Kim, J.-W. Ha and K. Lee , Phase-aware speech enhancement with deep complex u-net, Int. Conf. Learning Representations (2019). Google Scholar
    • 28. C. Trabelsi, O. Bilaniuk, Y. Zhang, D. Serdyuk, S. Subramanian, J. Santos, S. Mehri, N. Rostamzadeh, Y. Bengio and C. Pal, Deep complex networks, arXiv:1705.09792. Google Scholar
    • 29. J. R. Hershey, Z. Chen, J. Le Roux and S. Watanabe , Deep clustering: Discriminative embeddings for segmentation and separation, Proc. 2016 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2016), pp. 31–35. CrossrefGoogle Scholar
    • 30. D. Yu, M. Kolbæk, Z.-H. Tan and J. Jensen , Permutation invariant training of deep models for speaker-independent multi-talker speech separation, Proc. 2017 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2017), pp. 241–245. CrossrefGoogle Scholar
    • 31. M. Kolbæk, D. Yu, Z.-H. Tan and J. Jensen , Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks, IEEE/ACM Trans. Audio, Speech, Lang. Process. 25 (2017) 1901–1913. CrossrefGoogle Scholar
    • 32. Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. Hershey, R. A. Saurous, R. J. Weiss, Y. Jia and I. L. Moreno, VoiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking, arXiv:1810.04826. Google Scholar
    • 33. J. Chen and D. Wang , Long short-term memory for speaker generalization in supervised speech separation, J. Acoust. Soc. Amer. 141 (2017) 4705–4714. Crossref, ISIGoogle Scholar
    • 34. J. Garofolo, D. Graff, D. Paul and D. Pallett, CSR-I (WSJ0) complete LDC93S6A, Web Download. Philadelphia: Linguistic Data Consortium 83 (1993). Google Scholar
    • 35. D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv:1412.6980. Google Scholar
    • 36. E. Vincent, R. Gribonval and C. Févotte , Performance measurement in blind audio source separation, IEEE Trans. Audio, Speech, Lang. Process. 14 (2006) 1462–1469. Crossref, ISIGoogle Scholar
    • 37. Z. Wang, X. Wang, X. Li, Q. Fu and Y. Yan , Oracle performance investigation of the ideal masks, Proc. 2016 IEEE Int. Workshop on Acoustic Signal Enhancement (IWAENC), 2016, pp. 1–5. CrossrefGoogle Scholar
    • 38. Y. Isik, J. L. Roux, Z. Chen, S. Watanabe and J. R. Hershey, Single-channel multi-speaker separation using deep clustering, arXiv:1607.02173. Google Scholar