Research PaperNo Access

End-to-End Monaural Speech Separation with a Deep Complex U-Shaped Network

College of Meteorology and Oceanography, College of Computer Science and Technology, National University of Defense Technology, Changsha, P. R. China,

E-mail Address: [email protected]

Search for more papers by this author

Xiaoyong Li

College of Meteorology and Oceanography, College of Computer Science and Technology, National University of Defense Technology, Changsha, P. R. China,

E-mail Address: [email protected]

Corresponding author.

Search for more papers by this author

Aolong Zhou

College of Meteorology and Oceanography, College of Computer Science and Technology, National University of Defense Technology, Changsha, P. R. China,

E-mail Address: [email protected]

Search for more papers by this author

Kefeng Deng

College of Meteorology and Oceanography, College of Computer Science and Technology, National University of Defense Technology, Changsha, P. R. China,

E-mail Address: [email protected]

Search for more papers by this author

Kaijun Ren

College of Meteorology and Oceanography, College of Computer Science and Technology, National University of Defense Technology, Changsha, P. R. China,

E-mail Address: [email protected]

Search for more papers by this author

, and

Junqiang Song

College of Meteorology and Oceanography, College of Computer Science and Technology, National University of Defense Technology, Changsha, P. R. China,

E-mail Address: [email protected]

Search for more papers by this author

https://doi.org/10.1142/S0218126622500281Cited by:1 (Source: Crossref)

Abstract

Conventional time–frequency (TF) domain source separation methods mainly focus on predicting TF-masks or speech spectrums, where complex ideal ratio mask (cIRM) is an effective target for speech enhancement and separation. However, some recent studies employ a real-valued network, such as a general convolutional neural network (CNN) and a recurrent neural network (RNN), to predict a complex-valued mask or a spectrogram target, leading to the unbalanced training results of real and imaginary parts. In this paper, to estimate the complex-valued target more accurately, a novel U-shaped complex network for the complex signal approximation (uCSA) method is proposed. The uCSA is an adaptive front-end time-domain separation method, which tackles the monaural source separation problem in three ways. First, we design and implement a complex U-shaped network architecture comprising well-defined complex-valued encoder and decoder blocks, as well as complex-valued bidirectional Long Short-Term Memory (BLSTM) layers, to process complex-valued operations. Second, the cIRM is the training target of our uCSA method, optimized by signal approximation (SA), which takes advantage of both real and imaginary components of the complex-valued spectrum. Third, we re-formulate STFT and inverse STFT into derivable formats, and the model is trained with the scale-invariant source-to-noise ratio (SI-SNR) loss, achieving end-to-end training of the speech source separation task. Moreover, the proposed uCSA models are evaluated on the WSJ0-2mix datasets, which is a valid corpus commonly used by many supervised speech separation methods. Extensive experimental results indicate that our proposed method obtains state-of-the-art performance on the basis of the perceptual evaluation of speech quality (PESQ) and the short-time objective intelligibility (STOI) metrics.

This paper was recommended by Regional Editor Tongquan Wei.

Keywords:

References

1. D. Wang and J. Chen , Supervised speech separation based on deep learning: An overview, IEEE/ACM Trans. Audio, Speech, Lang. Process. 26 (2018) 1702–1726. Crossref, Google Scholar
2. A. Aroudi and S. Doclo , Cognitive-driven binaural LCMV beamformer using EEG-based auditory attention decoding, Proc. 2019 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2019), pp. 406–410. Crossref, Google Scholar
3. F. Haider and S. Luz , A system for real-time privacy preserving data collection for ambient assisted living, Proc. Interspeech (2019), pp. 2374–2375. Google Scholar
4. Y. He et al., Streaming end-to-end speech recognition for mobile devices, Proc. 2019 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2019), pp. 6381–6385. Crossref, Google Scholar
5. K. Han, Y. Wang and D. Wang , Learning spectral mapping for speech dereverberation, Proc. 2014 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), (IEEE, 2014), pp. 4628–4632. Crossref, Google Scholar
6. P.-S. Huang, M. Kim, M. Hasegawa-Johnson and P. Smaragdis , Deep learning for monaural speech separation, Proc. 2014 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2014), pp. 1562–1566. Crossref, Google Scholar
7. Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, J. Wu, B. Zhang and L. Xie, DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement, arXiv:2008.00264. Google Scholar
8. F. Bahmaninezhad, S.-X. Zhang, Y. Xu, M. Yu, J. H. Hansen and D. Yu, A unified framework for speech separation, arXiv:1912.07814. Google Scholar
9. S.-W. Fu, T.-W. Wang, Y. Tsao, X. Lu and H. Kawai , End-to-end waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks, IEEE/ACM Trans. Audio, Speech, Lang. Process. 26 (2018) 1570–1584. Crossref, Google Scholar
10. D. Stoller, S. Ewert and S. Dixon, Wave-u-net: A multi-scale neural network for end-to-end audio source separation, arXiv:1806.03185. Google Scholar
11. Y. Luo and N. Mesgarani , TasNet: Time-domain audio separation network for real-time, single-channel speech separation, Proc. 2018 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2018), pp. 696–700. Crossref, Google Scholar
12. Z. Shi, H. Lin, L. Liu, R. Liu, J. Han and A. Shi, FurcaNeXt: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks, arXiv:1902.04891. Google Scholar
13. Y. Luo and N. Mesgarani , Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation, IEEE/ACM Trans. Audio, Speech, Lang. Process. 27 (2019) 1256–1266. Crossref, Google Scholar
14. Y. Luo, Z. Chen and T. Yoshioka , Dual-path RNN: Efficient long sequence modeling for time-domain single-channel speech separation, Proc. 2020 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2020), pp. 46–50. Crossref, Google Scholar
15. Y. Xu, J. Du, L.-R. Dai and C.-H. Lee , A regression approach to speech enhancement based on deep neural networks, IEEE/ACM Trans. Audio, Speech, Lang. Process. 23 (2014) 7–19. Crossref, Google Scholar
16. S. Srinivasan, N. Roman and D. Wang , Binary and ratio time-frequency masks for robust speech recognition, Speech Commun. 48 (2006) 1486–1501. Crossref, ISI, Google Scholar
17. A. Narayanan and D. Wang , Ideal ratio mask estimation using deep neural networks for robust speech recognition, Proc. 2013 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 7092–7096. Crossref, Google Scholar
18. Y. Wang, A. Narayanan and D. Wang , On training targets for supervised speech separation, IEEE/ACM Trans. Audio, Speech, Lang. Process. 22 (2014) 1849–1858. Crossref, Google Scholar
19. Y. Wang and D. Wang , Boosting classification based speech separation using temporal dynamics, Proc. Thirteenth Annual Conf. International Speech Communication Association, Portland, Oregon, USA, 9–13 September 2012, pp. 1528–1531. Google Scholar
20. H. Erdogan, J. R. Hershey, S. Watanabe and J. Le Roux , Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks, Proc. 2015 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2015), pp. 708–712. Crossref, Google Scholar
21. D. S. Williamson, Y. Wang and D. Wang , Complex ratio masking for monaural speech separation, IEEE/ACM Trans. Audio, Speech, Lang. Process. 24 (2015) 483–492. Crossref, Google Scholar
22. W. Zhang, X. Li, A. Zhou, K. Ren and J. Song , pcIRM: Complex ideal ratio masking for speaker-independent monaural source separation with utterance permutation invariant training, Proc. 2020 Int. Joint Conf. Neural Networks (IJCNN), 2020, pp. 1–8. Crossref, Google Scholar
23. Y. Liu, H. Zhang, X. Zhang and L. Yang , Supervised speech enhancement with real spectrum approximation, Proc. 2019 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2019), pp. 5746–5750. Crossref, Google Scholar
24. W. Zhang, X. Li, A. Zhou, K. Ren and J. Song , Incorporating phase-encoded spectrum masking into speaker-independent monaural source separation, Big Data Res. 22 (2020) 100158. Crossref, ISI, Google Scholar
25. K. Tan and D. Wang , Complex spectral mapping with a convolutional recurrent network for monaural speech enhancement, Proc. 2019 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2019), pp. 6865–6869. Crossref, Google Scholar
26. D. Yin, C. Luo, Z. Xiong and W. Zeng , PHASEN: A phase-and-harmonics-aware speech enhancement network, AAAI, 2020, pp. 9458–9465. Crossref, Google Scholar
27. H.-S. Choi, J.-H. Kim, J. Huh, A. Kim, J.-W. Ha and K. Lee , Phase-aware speech enhancement with deep complex u-net, Int. Conf. Learning Representations (2019). Google Scholar
28. C. Trabelsi, O. Bilaniuk, Y. Zhang, D. Serdyuk, S. Subramanian, J. Santos, S. Mehri, N. Rostamzadeh, Y. Bengio and C. Pal, Deep complex networks, arXiv:1705.09792. Google Scholar
29. J. R. Hershey, Z. Chen, J. Le Roux and S. Watanabe , Deep clustering: Discriminative embeddings for segmentation and separation, Proc. 2016 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2016), pp. 31–35. Crossref, Google Scholar
30. D. Yu, M. Kolbæk, Z.-H. Tan and J. Jensen , Permutation invariant training of deep models for speaker-independent multi-talker speech separation, Proc. 2017 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2017), pp. 241–245. Crossref, Google Scholar
31. M. Kolbæk, D. Yu, Z.-H. Tan and J. Jensen , Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks, IEEE/ACM Trans. Audio, Speech, Lang. Process. 25 (2017) 1901–1913. Crossref, Google Scholar
32. Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. Hershey, R. A. Saurous, R. J. Weiss, Y. Jia and I. L. Moreno, VoiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking, arXiv:1810.04826. Google Scholar
33. J. Chen and D. Wang , Long short-term memory for speaker generalization in supervised speech separation, J. Acoust. Soc. Amer. 141 (2017) 4705–4714. Crossref, ISI, Google Scholar
34. J. Garofolo, D. Graff, D. Paul and D. Pallett, CSR-I (WSJ0) complete LDC93S6A, Web Download. Philadelphia: Linguistic Data Consortium 83 (1993). Google Scholar
35. D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv:1412.6980. Google Scholar
36. E. Vincent, R. Gribonval and C. Févotte , Performance measurement in blind audio source separation, IEEE Trans. Audio, Speech, Lang. Process. 14 (2006) 1462–1469. Crossref, ISI, Google Scholar
37. Z. Wang, X. Wang, X. Li, Q. Fu and Y. Yan , Oracle performance investigation of the ideal masks, Proc. 2016 IEEE Int. Workshop on Acoustic Signal Enhancement (IWAENC), 2016, pp. 1–5. Crossref, Google Scholar
38. Y. Isik, J. L. Roux, Z. Chen, S. Watanabe and J. R. Hershey, Single-channel multi-speaker separation using deep clustering, arXiv:1607.02173. Google Scholar

Vol. 31, No. 02

Metrics

History

Received 15 March 2021

Accepted 7 July 2021

Published: 26 August 2021

Keywords

PDF download

Adblock Plus Instructions

Adblock Instructions

uBlock Origin Instructions

uBlock Instructions

Adguard Instructions

Brave Instructions

Adremover Instructions

Adblock Genesis Instructions

Super Adblocker Instructions

Ultrablock Instructions

Ad Aware Instructions

Ghostery Instructions

Firefox Tracking Protection Instructions

Duck Duck Go Instructions

Privacy Badger Instructions

Disconnect Instructions

Opera Instructions

System Upgrade on Tue, Oct 25th, 2022 at 2am (EDT)

End-to-End Monaural Speech Separation with a Deep Complex U-Shaped Network

Abstract

References

Select your blocker:

Adblock Plus Instructions

Adblock Instructions

uBlock Origin Instructions

uBlock Instructions

Adguard Instructions

Brave Instructions

Adremover Instructions

Adblock Genesis Instructions

Super Adblocker Instructions

Ultrablock Instructions

Ad Aware Instructions

Ghostery Instructions

Firefox Tracking Protection Instructions

Duck Duck Go Instructions

Privacy Badger Instructions

Disconnect Instructions

Opera Instructions

System Upgrade on Tue, Oct 25th, 2022 at 2am (EDT)

End-to-End Monaural Speech Separation with a Deep Complex U-Shaped Network

Abstract

References

Recommended