End-to-End Monaural Speech Separation with a Deep Complex U-Shaped Network
Abstract
Conventional time–frequency (TF) domain source separation methods mainly focus on predicting TF-masks or speech spectrums, where complex ideal ratio mask (cIRM) is an effective target for speech enhancement and separation. However, some recent studies employ a real-valued network, such as a general convolutional neural network (CNN) and a recurrent neural network (RNN), to predict a complex-valued mask or a spectrogram target, leading to the unbalanced training results of real and imaginary parts. In this paper, to estimate the complex-valued target more accurately, a novel U-shaped complex network for the complex signal approximation (uCSA) method is proposed. The uCSA is an adaptive front-end time-domain separation method, which tackles the monaural source separation problem in three ways. First, we design and implement a complex U-shaped network architecture comprising well-defined complex-valued encoder and decoder blocks, as well as complex-valued bidirectional Long Short-Term Memory (BLSTM) layers, to process complex-valued operations. Second, the cIRM is the training target of our uCSA method, optimized by signal approximation (SA), which takes advantage of both real and imaginary components of the complex-valued spectrum. Third, we re-formulate STFT and inverse STFT into derivable formats, and the model is trained with the scale-invariant source-to-noise ratio (SI-SNR) loss, achieving end-to-end training of the speech source separation task. Moreover, the proposed uCSA models are evaluated on the WSJ0-2mix datasets, which is a valid corpus commonly used by many supervised speech separation methods. Extensive experimental results indicate that our proposed method obtains state-of-the-art performance on the basis of the perceptual evaluation of speech quality (PESQ) and the short-time objective intelligibility (STOI) metrics.
This paper was recommended by Regional Editor Tongquan Wei.
References
- 1. , Supervised speech separation based on deep learning: An overview, IEEE/ACM Trans. Audio, Speech, Lang. Process. 26 (2018) 1702–1726. Crossref, Google Scholar
- 2. , Cognitive-driven binaural LCMV beamformer using EEG-based auditory attention decoding, Proc. 2019 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2019), pp. 406–410. Crossref, Google Scholar
- 3. , A system for real-time privacy preserving data collection for ambient assisted living, Proc. Interspeech (2019), pp. 2374–2375. Google Scholar
- 4. , Streaming end-to-end speech recognition for mobile devices, Proc. 2019 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2019), pp. 6381–6385. Crossref, Google Scholar
- 5. , Learning spectral mapping for speech dereverberation, Proc. 2014 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), (IEEE, 2014), pp. 4628–4632. Crossref, Google Scholar
- 6. , Deep learning for monaural speech separation, Proc. 2014 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2014), pp. 1562–1566. Crossref, Google Scholar
- 7. Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, J. Wu, B. Zhang and L. Xie, DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement, arXiv:2008.00264. Google Scholar
- 8. F. Bahmaninezhad, S.-X. Zhang, Y. Xu, M. Yu, J. H. Hansen and D. Yu, A unified framework for speech separation, arXiv:1912.07814. Google Scholar
- 9. , End-to-end waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks, IEEE/ACM Trans. Audio, Speech, Lang. Process. 26 (2018) 1570–1584. Crossref, Google Scholar
- 10. D. Stoller, S. Ewert and S. Dixon, Wave-u-net: A multi-scale neural network for end-to-end audio source separation, arXiv:1806.03185. Google Scholar
- 11. , TasNet: Time-domain audio separation network for real-time, single-channel speech separation, Proc. 2018 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2018), pp. 696–700. Crossref, Google Scholar
- 12. Z. Shi, H. Lin, L. Liu, R. Liu, J. Han and A. Shi, FurcaNeXt: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks, arXiv:1902.04891. Google Scholar
- 13. , Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation, IEEE/ACM Trans. Audio, Speech, Lang. Process. 27 (2019) 1256–1266. Crossref, Google Scholar
- 14. , Dual-path RNN: Efficient long sequence modeling for time-domain single-channel speech separation, Proc. 2020 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2020), pp. 46–50. Crossref, Google Scholar
- 15. , A regression approach to speech enhancement based on deep neural networks, IEEE/ACM Trans. Audio, Speech, Lang. Process. 23 (2014) 7–19. Crossref, Google Scholar
- 16. , Binary and ratio time-frequency masks for robust speech recognition, Speech Commun. 48 (2006) 1486–1501. Crossref, ISI, Google Scholar
- 17. , Ideal ratio mask estimation using deep neural networks for robust speech recognition, Proc. 2013 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 7092–7096. Crossref, Google Scholar
- 18. , On training targets for supervised speech separation, IEEE/ACM Trans. Audio, Speech, Lang. Process. 22 (2014) 1849–1858. Crossref, Google Scholar
- 19. , Boosting classification based speech separation using temporal dynamics, Proc. Thirteenth Annual Conf. International Speech Communication Association,
Portland, Oregon, USA ,9–13 September 2012 , pp. 1528–1531. Google Scholar - 20. , Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks, Proc. 2015 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2015), pp. 708–712. Crossref, Google Scholar
- 21. , Complex ratio masking for monaural speech separation, IEEE/ACM Trans. Audio, Speech, Lang. Process. 24 (2015) 483–492. Crossref, Google Scholar
- 22. , pcIRM: Complex ideal ratio masking for speaker-independent monaural source separation with utterance permutation invariant training, Proc. 2020 Int. Joint Conf. Neural Networks (IJCNN), 2020, pp. 1–8. Crossref, Google Scholar
- 23. , Supervised speech enhancement with real spectrum approximation, Proc. 2019 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2019), pp. 5746–5750. Crossref, Google Scholar
- 24. , Incorporating phase-encoded spectrum masking into speaker-independent monaural source separation, Big Data Res. 22 (2020) 100158. Crossref, ISI, Google Scholar
- 25. , Complex spectral mapping with a convolutional recurrent network for monaural speech enhancement, Proc. 2019 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2019), pp. 6865–6869. Crossref, Google Scholar
- 26. , PHASEN: A phase-and-harmonics-aware speech enhancement network, AAAI, 2020, pp. 9458–9465. Crossref, Google Scholar
- 27. , Phase-aware speech enhancement with deep complex u-net, Int. Conf. Learning Representations (2019). Google Scholar
- 28. C. Trabelsi, O. Bilaniuk, Y. Zhang, D. Serdyuk, S. Subramanian, J. Santos, S. Mehri, N. Rostamzadeh, Y. Bengio and C. Pal, Deep complex networks, arXiv:1705.09792. Google Scholar
- 29. , Deep clustering: Discriminative embeddings for segmentation and separation, Proc. 2016 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2016), pp. 31–35. Crossref, Google Scholar
- 30. , Permutation invariant training of deep models for speaker-independent multi-talker speech separation, Proc. 2017 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2017), pp. 241–245. Crossref, Google Scholar
- 31. , Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks, IEEE/ACM Trans. Audio, Speech, Lang. Process. 25 (2017) 1901–1913. Crossref, Google Scholar
- 32. Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. Hershey, R. A. Saurous, R. J. Weiss, Y. Jia and I. L. Moreno, VoiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking, arXiv:1810.04826. Google Scholar
- 33. , Long short-term memory for speaker generalization in supervised speech separation, J. Acoust. Soc. Amer. 141 (2017) 4705–4714. Crossref, ISI, Google Scholar
- 34. J. Garofolo, D. Graff, D. Paul and D. Pallett, CSR-I (WSJ0) complete LDC93S6A, Web Download. Philadelphia: Linguistic Data Consortium 83 (1993). Google Scholar
- 35. D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv:1412.6980. Google Scholar
- 36. , Performance measurement in blind audio source separation, IEEE Trans. Audio, Speech, Lang. Process. 14 (2006) 1462–1469. Crossref, ISI, Google Scholar
- 37. , Oracle performance investigation of the ideal masks, Proc. 2016 IEEE Int. Workshop on Acoustic Signal Enhancement (IWAENC), 2016, pp. 1–5. Crossref, Google Scholar
- 38. Y. Isik, J. L. Roux, Z. Chen, S. Watanabe and J. R. Hershey, Single-channel multi-speaker separation using deep clustering, arXiv:1607.02173. Google Scholar