Research PapersNo Access

Convolutional recurrent neural networks with multi-sized convolution filters for sound-event recognition

Feizhen Huang

http://orcid.org/0000-0002-2742-6664

School of Physics and Optoelectronics, Xiang Tan University, Xiangtan, Hunan Province 411105, China

Search for more papers by this author

Jinfang Zeng

School of Physics and Optoelectronics, Xiang Tan University, Xiangtan, Hunan Province 411105, China

E-mail Address: jinfangzengx@126.com

Corresponding author.

Search for more papers by this author

Yu Zhang

School of Physics and Optoelectronics, Xiang Tan University, Xiangtan, Hunan Province 411105, China

Search for more papers by this author

, and

Wentao Xu

School of Physics and Optoelectronics, Xiang Tan University, Xiangtan, Hunan Province 411105, China

Search for more papers by this author

https://doi.org/10.1142/S0217984920502358Cited by:2 (Source: Crossref)

Abstract

Sound-event recognition often utilizes time-frequency analysis to produce an image-like spectrogram that provides a rich visual representation of original signal in time and frequency. Convolutional Neural Networks (CNN) with the ability of learning discriminative spectrogram patterns are suitable for sound-event recognition. However, there is relatively little effort that CNN makes full use of the important temporal information. In this paper, we propose MCRNN, a Convolutional Recurrent Neural Networks (CRNN) architecture for sound-event recognition, the letter “M” in the name “MCRNN” of our model denotes the multi-sized convolution filters. Richer features are extracted by using several different convolution filter sizes at the last convolution layer. In addition, cochleagram images are used as the input layer of the network, instead of the traditional spectrogram image of a sound signal. Experiments on the RWCP dataset shows that the recognition rate of the proposed method achieved 98.4% in clean conditions, and it robustly outperforms the existing methods, the recognition rate increased by 0.9%, 1.9% and 10.3% in 20 dB, 10 dB and 0 dB signal-to-noise ratios (SNR), respectively.

Keywords:

References

1. R. F. Lyon, IEEE Signal Process. Mag. 27 (2010) 131. Web of Science, ADS, Google Scholar
2. P. Foggia, N. Petkov, A. Saggese, N. Strisciuglio and M. Vento, IEEE Trans. Intell. Transp. Syst. 17 (2015) 279. Web of Science, Google Scholar
3. V. Arora, R. Leekha, R. Singh and I. Chana, Mod. Phys. Lett. B 33 (2019) 1950321. Link, Web of Science, ADS, Google Scholar
4. R. Stables, G. Clemens, H. J. Butler, K. M. Ashton, A. Brodbelt, T. P. Dawson, L. M. Fullwood, M. D. Jenkinsone and M. J. Baker, Analyst 142 (2016) 98. Web of Science, ADS, Google Scholar
5. H. Yang, A. Gan, H. Chen, Y. Pan, J. Tang and J. Li, Underwater acoustic target recognition using SVM ensemble via weighted sample and feature selection, in Int. Conf. Applied Sciences and Technology (IEEE, 2016), pp. 522–527. Google Scholar
6. K. Qian, Z. Zhang, A. Baird and B. Schuller, J. Acoust. Soc. Am. 142 (2017) 1796. Web of Science, ADS, Google Scholar
7. S. Abdoli, P. Cardinal and A. L. Koerich, Expert Syst. Appl. 136 (2019) 252. Web of Science, Google Scholar
8. A. Temko and C. Nadeu, Pattern Recogn. 39 (2006) 682. Web of Science, ADS, Google Scholar
9. S. Chu, S. Narayanan and C. C. J. Kuo, IEEE Trans. Audio Speech Lang. Process. 17 (2009) 1142. Google Scholar
10. J. Dennis, H. D. Tran and H. Li, IEEE Signal Process. Lett. 18 (2010) 130. Web of Science, ADS, Google Scholar
11. J. Dennis, H. D. Tran and E. S. Chng, IEEE-ACM Trans. Audio Speech Lang. Process. 21 (2013) 367. Web of Science, Google Scholar
12. R. Kaur, R. K. Sharma and P. Kumar, Mod. Phys. Lett. B 32 (2018) 1850384. Link, Web of Science, ADS, Google Scholar
13. J. Fredes, J. Novoa, S. King, R. M. Stern and N. B. Yoma, IEEE Signal Process. Lett. 24 (2017) 377. Web of Science, ADS, Google Scholar
14. W. Shi, X. Zhang, X. Zou and W. Han, Mod. Phys. Lett. B 31 (2017) 1740096. Link, Web of Science, ADS, Google Scholar
15. I. McLoughlin, H. Zhang, Z. Xie, Y. Song and W. Xiao, IEEE-ACM Trans. Audio Speech Lang. Process. 23 (2015) 540. Web of Science, Google Scholar
16. Z. Xie, I. Mcloughlin, H. Zhang, Y. Song and W. Xiao, Digit. Signal Process. 54 (2016) 119. Web of Science, Google Scholar
17. H. Zhang, I. Mcloughlin and Y. Song, Robust sound event recognition using convolutional neural networks, in Int. Conf. Acoustics (IEEE, 2015), pp. 559–563. Google Scholar
18. R. V. Sharan and T. J. Moir, Appl. Acoust. 148 (2019) 62. Web of Science, Google Scholar
19. J. Salamon and J. P. Bello, IEEE Signal Process. Lett. 24 (2017) 279. Web of Science, ADS, Google Scholar
20. C. Y. Wang, J. C. Wang, A. Santoso, C. C. Chiang and C. H. Wu, IEEE-ACM Trans. Audio Speech Lang. Process. 26 (2017) 1336. Web of Science, ADS, Google Scholar
21. J. Ren, X. Jiang, J. Yuan and N. Magnenat-Thalmann, IEEE Trans. Multimedia 19 (2017) 447. Web of Science, Google Scholar
22. T. Fernando, S. Denman, A. McFadyen, S. Sridharan and C. Fookes, Neurocomputing 304 (2018) 64. Web of Science, Google Scholar
23. C. Xu, X. Lei and X. Xiong, J. Signal Process. Syst. 90 (2017) 1. Web of Science, Google Scholar
24. S. Adavanne, A. Politis, J. Nikunen and T. Virtanen, IEEE J. Sel. Top. Signal Process. 13 (2018) 34. Web of Science, ADS, Google Scholar
25. R. V. Sharan and T. J. Moir, Appl. Acoust. 140 (2018) 198. Web of Science, Google Scholar
26. L. H. Carney and T. C. Yin, J. Neurophysiol. 60 (1988) 1653. Web of Science, Google Scholar
27. S. Nakamura, K. Hiyane, F. Asano, T. Nishiura and T. Yamada, Acoustical sound database in real environments for sound scene understanding and hands-free speech recognition, in Int. Conf. Language Resources and Evaluation (IEEE, Athens, Greece, 2000), pp. 965–968. Google Scholar
28. A. Varga and H. J. M. Steeneken, Speech Commun. 12 (1993) 247. Web of Science, Google Scholar