No Access

Toward Domain Adaptive Learning-Based Variation Autoencoder Emotional Analysis in English Teaching

Henan Finance University, Zhengzhou 450000, Henan, P. R. China

https://doi.org/10.1142/S0218213024400062Cited by:0 (Source: Crossref)

This article is part of the issue:

Special Issue on Heterogeneous Domain Adaptation Learning and its Applications in Intelligent System
Guest Editors: Fa Zhu, Xiaochun Cheng and Sotiris Kotsiantis

Abstract

Speech emotion analysis plays an important role in English teaching by analyzing the reading state of students. Teachers can dynamically adjust the teaching content according to the emotional feedback of students and improve the teaching quality of the school. Due to unstable student emotions and background noise, the accuracy of speech emotion recognition is constrained. Although multimodal data can alleviate the deficiency of a single modality, collecting and annotating multimodal samples requires a significant amount of resources. To resolve this issue, this paper proposes a novel multimodal sentiment analysis framework based on domain adaptive learning mechanisms to assist English teaching. We construct a novel multi-task variation autoencoder framework in which we simultaneously complete reconstruction and classification tasks. To improve speech emotion recognition performance, we introduce domain adaptive learning based on the Wasserstein distance between two variational hidden layers from the video domain (source domain) and speech domain (target domain). To validate the effectiveness of our proposed model, we conducted extensive comparative experiments on two public datasets and a self-built English oral dataset. All experimental results indicate that domain adaptation learning mechanisms can effectively improve the recognition performance of the target domain. On the self-built dataset for English teaching, the proposed model achieves higher performance compared to other deep models.

Keywords:

Remember to check out the Most Cited Articles!
Check out Notable Titles in Artificial Intelligence.

References

1. E. Putri and English Education, An impact of the use Instagram application towards students vocabulary, Pustakailmu. id 2(2) (2022) 1–10. Google Scholar
2. O. Alrashidi and H. Phan, Education context and English teaching and learning in the Kingdom of Saudi Arabia: An overview, Engl. Lang. Teach. 8(5) (2015) 33–44. Crossref, Google Scholar
3. M. S. Effendi, U. Rokhyati, U. A.-M. Rachman, A. D. Rakhmawati and D. Pertiwi, A study on grammar teaching at an English education department in an EFL context, Int. J. Stud. Engl. Lang. Lit. 5(1) (2017) 42–46. Google Scholar
4. M. Benzeghiba et al., Automatic speech recognition and speech variability: A review, Speech Commun. 49(10–11) (2007) 763–786. Crossref, Web of Science, Google Scholar
5. W. Ghai and N. Singh, Literature review on automatic speech recognition, Int. J. Comput. Appl. 41(8) (2012) 42–50. Google Scholar
6. M. Liu, English speech emotion recognition method based on speech recognition, Int. J. Speech Technol. 25(2) (2022) 391–398. Crossref, Web of Science, Google Scholar
7. M. Neumann et al., Cross-lingual and multilingual speech emotion recognition on English and French, in 2018 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2018), pp. 5769–5773. Crossref, Google Scholar
8. M. K. Chowdary, T. N. Nguyen and D. J. Hemanth, Deep learning-based facial emotion recognition for human–computer interaction applications, Neural Comput. Appl. 35 (2023) 23311–23328. Crossref, Web of Science, Google Scholar
9. S. Goyal, N. Sharma, B. Bhushan, A. Shankar and M. Sagayam, IoT enabled technology in secured healthcare: Applications, challenges and future directions, in Cognitive Internet of Medical Things for Smart Healthcare: Services and Applications (Springer, 2021), pp. 25–48. Crossref, Google Scholar
10. A. Shankar and C. Maple, Securing the Internet of Things-enabled smart city infrastructure using a hybrid framework, Comput. Commun. 205 (2023) 127–135. Crossref, Web of Science, Google Scholar
11. A. Saxena, A. Khanna and D. Gupta, Emotion recognition and detection methods: A comprehensive survey, J. Artif. Intell. Syst. 2(1) (2020) 53–79. Google Scholar
12. X. Jin, X. Song, X. Wu and W. Yan, Transformer embedded spectral-based graph network for facial expression recognition, Int. J. Mach. Learn. Cybern. 15 (2024) 2063–2077. Crossref, Web of Science, Google Scholar
13. W. Chen, X. Xing, P. Chen and X. Xu, Vesper: A compact and effective pretrained model for speech emotion recognition, IEEE Trans. Affect. Comput. 15 (2024) 1711–1724. Crossref, Web of Science, Google Scholar
14. M. El Ayadi, M. S. Kamel and F. Karray, Survey on speech emotion recognition: Features, classification schemes, and databases, Pattern Recognit. 44(3) (2011) 572–587. Crossref, Web of Science, Google Scholar
15. D. Ververidis and C. Kotropoulos, Fast and accurate sequential floating forward feature selection with the Bayes classifier applied to speech emotion recognition, Signal Process. 88(12) (2008) 2956–2970. Crossref, Web of Science, Google Scholar
16. T. Seehapoch and S. Wongthanavasu, Speech emotion recognition using support vector machines, in 2013 5th Int. Conf. Knowledge and Smart Technology (KST) (IEEE, 2013), pp. 86–91. Crossref, Google Scholar
17. A. Koduru, H. B. Valiveti and A. K. Budati, Feature extraction algorithms to improve the speech emotion recognition rate, Int. J. Speech Technol. 23(1) (2020) 45–55. Crossref, Web of Science, Google Scholar
18. R. A. Khalil, E. Jones, M. I. Babar, T. Jan, M. H. Zafar and T. Alhussain, Speech emotion recognition using deep learning techniques: A review, IEEE Access 7 (2019) 117327–117345. Crossref, Web of Science, Google Scholar
19. Z. Huang, M. Dong, Q. Mao and Y. Zhan, Speech emotion recognition using CNN, in Proc. 22nd ACM Int. Conf. Multimedia (ACM, 2014), pp. 801–804. Crossref, Google Scholar
20. J. Lee and I. Tashev, High-level feature representation using recurrent neural network for speech emotion recognition, in Proc. Interspeech 2015 (ISCA, 2015), pp. 1537–1540. Crossref, Google Scholar
21. S. Yoon, S. Byun and K. Jung, Multimodal speech emotion recognition using audio and text, in 2018 IEEE Spoken Language Technology Workshop (SLT) (IEEE, 2018), pp. 112–118. Crossref, Google Scholar
22. P. Koromilas and T. Giannakopoulos, Deep multimodal emotion recognition on human speech: A review, Appl. Sci. 11(17) (2021) 7962. Crossref, Google Scholar
23. L. Sun, B. Liu, J. Tao and Z. Lian, Multimodal cross-and self-attention network for speech emotion recognition, in ICASSP 2021-2021 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2021), pp. 4275–4279. Crossref, Google Scholar
24. Z. Liu, T. Wang, F. Zhu, X. Chen, D. Pelusi and A. V. Vasilakos, Domain adaptive learning based on equilibrium distribution and dynamic subspace approximation, Expert Syst. Appl. 249 (2024) 123673. Crossref, Web of Science, Google Scholar
25. K. Weiss, T. M. Khoshgoftaar and D. D. Wang, A survey of transfer learning, J. Big Data 3(1) (2016) 1–40. Crossref, Google Scholar
26. Z. Liu, F. Zhu, K. Zhang, Z. Lai and H. Huo, Manifold transfer subspace learning based on double relaxed discriminative regression, Artif. Intell. Rev. 56(Suppl 1) (2023) 959–981. Crossref, Web of Science, Google Scholar
27. A. Farahani, S. Voghoei, K. Rasheed and H. R. Arabnia, A brief review of domain adaptation, in Advances in Data Science and Information Engineering: Proc. ICDATA 2020 and IKE 2020 (Springer, 2021), pp. 877–894. Crossref, Google Scholar
28. G. Kang, L. Jiang, Y. Yang and A. G. Hauptmann, Contrastive adaptation network for unsupervised domain adaptation, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (IEEE, 2019), pp. 4893–4902. Crossref, Google Scholar
29. E. Kobler, T. Klatzer, K. Hammernik and T. Pock, Variational networks: Connecting variational methods and deep learning, in Pattern Recognition: 39th German Conf., GCPR 2017 (Springer, 2017), pp. 281–293. Crossref, Google Scholar
30. M. Swain, A. Routray and P. Kabisatpathy, Databases, features and classifiers for speech emotion recognition: A review, Int. J. Speech Technol. 21 (2018) 93–120. Crossref, Web of Science, Google Scholar
31. X. Wu, Y. Cao, H. Lu, S. Liu, D. Wang, Z. Wu, X. Liu and H. Meng, Speech emotion recognition using sequential capsule networks, IEEE/ACM Trans. Audio Speech Lang. Process. 29 (2021) 3280–3291. Crossref, Google Scholar
32. N. Sato and Y. Obuchi, Emotion recognition using mel-frequency cepstral coefficients, Inf. Media Technol. 2(3) (2007) 835–848. Google Scholar
33. Y.-L. Lin and G. Wei, Speech emotion recognition based on HMM and SVM, in 2005 Int. Conf. Machine Learning and Cybernetics (IEEE, 2005), pp. 4898–4901. Crossref, Google Scholar
34. R. B. Lanjewar, S. Mathurkar and N. Patel, Implementation and comparison of speech emotion recognition system using Gaussian mixture model (GMM) and K-nearest neighbor (K-NN) techniques, Procedia Comput. Sci. 49 (2015) 50–57. Crossref, Google Scholar
35. S. P. Yadav, S. Zaidi, A. Mishra and V. Yadav, Survey on machine learning in speech emotion recognition and vision systems using a recurrent neural network (RNN), Arch. Comput. Methods Eng. 29(3) (2022) 1753–1770. Crossref, Web of Science, Google Scholar
36. L. Kerkeni, Y. Serrestou, M. Mbarki, K. Raoof, M. A. Mahjoub and C. Cleder, Automatic speech emotion recognition using machine learning, in Social Media and Machine Learning (IntechOpen, 2019). Google Scholar
37. H. Tang, W. Liu, W.-L. Zheng and B.-L. Lu, Multimodal emotion recognition using deep neural networks, in Neural Information Processing: 24th Int. Conf., ICONIP 2017 (Springer, 2017), pp. 811–819. Crossref, Google Scholar
38. K. Zhang, Y. Li, J. Wang, Z. Wang and X. Li, Feature fusion for multimodal emotion recognition based on deep canonical correlation analysis, IEEE Signal Process. Lett. 28 (2021) 1898–1902. Crossref, Web of Science, Google Scholar
39. V. Chudasama, P. Kar, A. Gudmalwar, N. Shah, P. Wasnik and N. Onoe, M2FNet: Multi-modal fusion network for emotion recognition in conversation, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (IEEE, 2022), pp. 4652–4661. Crossref, Google Scholar
40. K.-S. Song, Y.-H. Nho, J.-H. Seo and D.-S. Kwon, Decision-level fusion method for emotion recognition using multimodal emotion recognition information, in 2018 15th Int. Conf. Ubiquitous Robots (UR) (IEEE, 2018), pp. 472–476. Crossref, Google Scholar
41. J. Huang, J. Tao, B. Liu, Z. Lian and M. Niu, Multimodal transformer fusion for continuous emotion recognition, in ICASSP 2020-2020 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2020), pp. 3507–3511. Crossref, Google Scholar
42. F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong and Q. He, A comprehensive survey on transfer learning, Proc. IEEE 109(1) (2020) 43–76. Crossref, Web of Science, Google Scholar
43. T. Wang, J. Huan and M. Zhu, Instance-based deep transfer learning, in 2019 IEEE Winter Conf. Applications of Computer Vision (WACV) (IEEE, 2019), pp. 367–375. Crossref, Google Scholar
44. T. Liu, S. Xie, J. Yu, L. Niu and W. Sun, Classification of thyroid nodules in ultrasound images using deep model based transfer learning and hybrid features, in 2017 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2017), pp. 919–923. Crossref, Google Scholar
45. S. Niu, Y. Liu, J. Wang and H. Song, A decade survey of transfer learning (2010–2020), IEEE Trans. Artif. Intell. 1(2) (2020) 151–166. Crossref, Google Scholar
46. K. You, M. Long, Z. Cao, J. Wang and M. I. Jordan, Universal domain adaptation, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (IEEE, 2019), pp. 2720–2729. Crossref, Google Scholar
47. C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee and S. S. Narayanan, IEMOCAP: Interactive emotional dyadic motion capture database, Lang. Res. Eval. 42 (2008) 335–359. Crossref, Web of Science, Google Scholar
48. A. Zadeh, R. Zellers, E. Pincus and L.-P. Morency, MOSI: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos, preprint (2016), arXiv:1606.06259. Google Scholar
49. J. Kominek and A. W. Black, The CMU Arctic speech databases, in Fifth ISCA Workshop Speech Synthesis (ISCA, 2004), pp. 223–224. Google Scholar
50. Z. Peng, Z. Zhu, M. Unoki, J. Dang and M. Akagi, Auditory-inspired end-to-end speech emotion recognition using 3D convolutional recurrent neural networks based on spectral-temporal representation, in 2018 IEEE Int. Conf. Multimedia and Expo (ICME) (IEEE, 2018), pp. 1–6. Crossref, Google Scholar
51. M. Neumann and N. T. Vu, Attentive convolutional neural network based speech emotion recognition: A study on the impact of input features, signal length, and acted speech, preprint (2017), arXiv:1706.00612. Google Scholar
52. G. Ramet, P. N. Garner, M. Baeriswyl and A. Lazaridis, Context-aware attention mechanism for speech emotion recognition, in 2018 IEEE Spoken Language Technology Workshop (SLT) (IEEE, 2018), pp. 126–131. Crossref, Google Scholar
53. Y. Yu and Y.-J. Kim, Attention-LSTM-attention model for speech emotion recognition and analysis of IEMOCAP database, Electronics 9(5) (2020) 713. Crossref, Web of Science, Google Scholar
54. Z. Zhao, Z. Bao, Y. Zhao, Z. Zhang, N. Cummins, Z. Ren and B. Schuller, Exploring deep spectrum representations via attention-based recurrent and convolutional neural networks for speech emotion recognition, IEEE Access 7 (2019) 97515–97525. Crossref, Web of Science, Google Scholar
55. M. Hou, Z. Zhang, Q. Cao, D. Zhang and G. Lu, Multi-view speech emotion recognition via collective relation construction, IEEE/ACM Trans. Audio Speech Lang. Process. 30 (2021) 218–229. Crossref, Google Scholar
56. Y. Zhou, X. Liang, Y. Gu, Y. Yin and L. Yao, Multi-classifier interactive learning for ambiguous speech emotion recognition, IEEE/ACM Trans. Audio Speech Lang. Process. 30 (2022) 695–705. Crossref, Google Scholar
57. Y. Zhong, Y. Hu, H. Huang and W. Silamu, A lightweight model based on separable convolution for speech emotion recognition, in Proc. Interspeech 2020 (ISCA, 2020), pp. 3331–3335. Crossref, Google Scholar
58. A. Zadeh, P. P. Liang, N. Mazumder, S. Poria, E. Cambria and L.-P. Morency, Memory fusion network for multi-view sequential learning, in Proc. AAAI Conf. Artificial Intelligence (AAAI Press, 2018), pp. 5634–5641. Crossref, Google Scholar
59. Y. Wang, Y. Shen, Z. Liu, P. P. Liang, A. Zadeh and L.-P. Morency, Words can shift: Dynamically adjusting word representations using nonverbal behaviors, in Proc. AAAI Conf. Artificial Intelligence (AAAI Press, 2019), pp. 7216–7223. Crossref, Google Scholar
60. H. Pham, P. P. Liang, T. Manzini, L.-P. Morency and B. Póczos, Found in translation: Learning robust joint representations by cyclic translations between modalities, in Proc. AAAI Conf. Artificial Intelligence (AAAI Press, 2019), pp. 6892–6899. Crossref, Google Scholar
61. Z. Sun, P. Sarma, W. Sethares and Y. Liang, Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis, in Proc. AAAI Conf. Artificial Intelligence (AAAI Press, 2020), pp. 8992–8999. Crossref, Google Scholar
62. S. Mai, H. Hu, J. Xu and S. Xing, Multi-fusion residual memory network for multimodal human sentiment comprehension, IEEE Trans. Affect. Comput. 13(1) (2020) 320–334. Crossref, Google Scholar
63. Y.-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency and R. Salakhutdinov, Multimodal transformer for unaligned multimodal language sequences, in Proc. Conf. Association for Computational Linguistics Meeting (HHS Public Access, 2019), pp. 6558–6569. Crossref, Google Scholar
64. S. Sahay, E. Okur, S. H. Kumar and L. Nachman, Low rank fusion based transformers for multimodal sequences, preprint (2020), arXiv:2007.02038. Google Scholar
65. M. Chen and X. Li, SWAFN: Sentimental words aware fusion network for multimodal sentiment analysis, in Proc. 28th Int. Conf. Computational Linguistics (International Committee on Computational Linguistics, 2020), pp. 1067–1077. Crossref, Google Scholar
66. Q. Li, D. Gkoumas, C. Lioma and M. Melucci, Quantum-inspired multimodal fusion for video sentiment analysis, Inf. Fusion 65 (2021) 58–71. Crossref, Web of Science, Google Scholar
67. Q. Zhang, L. Shi, P. Liu, Z. Zhu and L. Xu, ICDN: Integrating consistency and difference networks by transformer for multimodal sentiment analysis, Appl. Intell. 53 (2023) 16332–16345. Crossref, Web of Science, Google Scholar
68. F. Zhu, X. Chen, S. Chen, W. Zheng and W. Ye, Relative margin induced support vector ordinal regression, Expert Syst. Appl. 231 (2023) 120766. Crossref, Web of Science, Google Scholar
69. F. Zhu, X. Chen, X. Gao, W. Ye, H. Zhao and A. V. Vasilakos, Constraint-weighted support vector ordinal regression to resist constraint noises, Inf. Sci. 649 (2023) 119644. Crossref, Web of Science, Google Scholar