BiometricsFree Access

Head Pose Estimation Based on Multi-Level Feature Fusion

School of Physics and Electronic, Northwest Normal University, Lanzhou 730070, P. R. China

Engineering Research Center of Gansu Province for Intelligent Information Technology and Application, Lanzhou 730070, P. R. China

E-mail Address: yancm2022@163.com

Corresponding author.

Search for more papers by this author

and

Xiao Zhang

https://orcid.org/0009-0007-9982-1886

School of Physics and Electronic, Northwest Normal University, Lanzhou 730070, P. R. China

E-mail Address: 2022222483@nwnu.edu.cn

Search for more papers by this author

https://doi.org/10.1142/S0218001424560020Cited by:3 (Source: Crossref)

Abstract

Head Pose Estimation (HPE) has a wide range of applications in computer vision, but still faces challenges: (1) Existing studies commonly use Euler angles or quaternions as pose labels, which may lead to discontinuity problems. (2) HPE does not effectively address regression via rotated matrices. (3) There is a low recognition rate in complex scenes, high computational requirements, etc. This paper presents an improved unconstrained HPE model to address these challenges. First, a rotation matrix form is introduced to solve the problem of unclear rotation labels. Second, a continuous 6D rotation matrix representation is used for efficient and robust direct regression. The RepVGG-A2 lightweight framework is used for feature extraction, and by adding a multi-level feature fusion module and a coordinate attention mechanism with residual connection, to improve the network’s ability to perceive contextual information and pay attention to features. The model’s accuracy was further improved by replacing the network activation function and improving the loss function. Experiments on the BIWI dataset 7:3 dividing the training and test sets show that the average absolute error of HPE for the proposed network model is 2.41. Trained on the dataset 300W_LP and tested on the AFLW2000 and BIWI datasets, the average absolute errors of HPE of the proposed network model are 4.34 and 3.93. The experimental results demonstrate that the improved network has better HPE performance.

Keywords:

References

1. A. Bulat and G. Tzimiropoulos , How far are we from solving the 2D & 3D face alignment problem? (and a dataset of 230,000 3D facial landmarks), in IEEE Int. Conf. Computer Vision (ICCV), Venice, Italy, 22–29 October 2017, pp. 1021–1030. Google Scholar
2. Z. Cao, Z. Chu, D. Liu and Y. Chen , A vector-based representation to enhance head pose estimation, in IEEE Winter Conf. Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021, pp. 1187–1196. Google Scholar
3. T. Chuan, H. Xinrui, W. Zhicheng, Z. Yu, X. Mingyu and W. Xin , Head pose estimation via multi-task cascade CNN, in Proc. 2019 3rd High Performance Computing and Cluster Technologies Conf., Guangzhou, China, 22 June 2019, pp. 123–127. Google Scholar
4. A. Dapogny, K. Bailly and M. Cord , Deep entwined learning head pose and face alignment inside an attentional cascade with doubly-conditional fusion, in IEEE Int. Conf. Automatic Face and Gesture Recognition, Buenos Aires, Argentina, 16–20 November 2020, pp. 192–198. Google Scholar
5. A. Gupta, K. Thakkar, V. Gandhi and P. J. Narayanan , Nose, eyes and ears: Head pose estimation by locating facial keypoints, in IEEE Int. Conf. Acoustics, Speech and Signal Processing, Brighton, UK, 12–17 May 2019, pp. 1977–1981. Google Scholar
6. T. Hempel, A. A. Abdelrahman and A. Al-Hamadi , 6D rotation representation for unconstrained head pose estimation, in IEEE Int. Conf. Image Processing (ICIP), Bordeaux, France, 16–19 October 2022, pp. 2496–2500. Google Scholar
7. D. Hendrycks and K. Gimpel, Gaussian error linear units (GELUs), preprint (2016), arXiv:1606.08415. Google Scholar
8. Q. Hou, D. Zhou and J. Feng , Coordinate attention for efficient mobile network design, in IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021, pp. 13708–13717. Google Scholar
9. H.-W. Hsu, T.-Y. Wu, S. Wan, W. H. Wong and C.-Y. Lee , QuatNet: Quaternion-based head pose estimation with multiregression loss, IEEE Trans. Multimed. 21(4) (2019) 1035–1046. Crossref, Web of Science, Google Scholar
10. B. Huang, R. W. Chen, W. Xu and Q. B. Zhou , Improving head pose estimation using two-stage ensembles with top- $k$ regression, Image Vis. Comput. 93 (2020) 103827. Crossref, Web of Science, Google Scholar
11. V. Lepetit and P. Fua , Monocular model-based 3D tracking of rigid objects: A survey, Found. Trend Comput. Graph. Vis. 1(1) (2005) 1–89. Crossref, Google Scholar
12. H. Liu, S. Fang, Z. Zhang, D. Li, K. Lin and J. Wang , MFDNet: Collaborative poses perception and matrix fisher distribution for head pose estimation, IEEE Trans. Multimed. 24 (2022) 2449–2460. Crossref, Web of Science, Google Scholar
13. Y. Liu, X. Li, F. Fang, F. Zhang, J. Chen and Z. Zeng , Visual focus of attention and spontaneous smile recognition based on continuous head pose estimation by cascaded multi-task learning, Int. J. Pattern Recogn. Artif. Intell. 33(7) (2019) 1940006. Link, Web of Science, Google Scholar
14. M. Martin, F. Van De Camp and R. Stiefelhagen , Real time head model creation and head pose estimation on consumer depth cameras, in Int. Conf. 3D Vision, Tokyo, Japan, 8–11 December 2014, pp. 641–648. Google Scholar
15. M. I. Perdana, W. Anggraeni, H. A. Sidharta, E. M. Yuniarno and M. H. Purnomo , Early warning pedestrian crossing intention from its head gesture using head pose Estimation, in Int. Seminar on Intelligent Technology and Its Applications, Surabaya, Indonesia, 21–22 July 2021, pp. 402–407. Google Scholar
16. N. Ruiz, E. Chong and J. M. Rehg , Fine-grained head pose estimation without keypoints, in IEEE/CVF Conf. Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018, pp. 2155–215509. Google Scholar
17. K. Sankaranarayanan, M.-C. Chang and N. Krahnstoever , Tracking gaze direction from far-field surveillance cameras, in IEEE Workshop on Applications of Computer Vision, Kona, HI, USA, 5–7 January 2011, pp. 519–526. Google Scholar
18. R. Valenti, N. Sebe and T. Gevers , Combining head pose and eye location information for gaze estimation, IEEE Trans. Image Process. 21(2) (2012) 802–815. Crossref, Web of Science, Google Scholar
19. K. Wang, R. Zhao and Q. Ji , Human computer interaction with head pose, eye gaze and body gestures, in 2018 13th IEEE Int. Conf. Automatic Face & Gesture Recognition, Xi’an, China, 15–19 May 2018, pp. 789–789. Google Scholar
20. J. Xia, L. Cao, G. Zhang and J. Liao , Head pose estimation in the wild assisted by facial landmarks based on convolutional neural networks, IEEE Access 7 (2019) 48470–48483. Crossref, Web of Science, Google Scholar
21. X. Xu and I. A. Kakadiaris , Joint head pose estimation and face alignment framework using global and local CNN features, in IEEE Int. Conf. Automatic Face & Gesture Recognition, Washington, DC, USA, 30 May–3 June 2017, pp. 642–649. Google Scholar
22. T.-Y. Yang, Y.-T. Chen, Y.-Y. Lin and Y.-Y. Chuang , FSA-Net: Learning fine-grained structure aggregation for head pose estimation from a single image, in IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019, pp. 1087–1096. Google Scholar
23. T.-Y. Yang, Y.-H. Huang, Y.-Y. Lin, P.-C. Hsiu and Y.-Y. Chuang , SSR-Net: A compact soft stagewise regression network for age estimation, in Int. Joint Conf. Artificial Intelligence, 13–19 July 2018, p. 7. Google Scholar
24. Z. Zeng et al., SRNet: Structural relation-aware network for head pose estimation, in Int. Conf. Pattern Recognition, Montreal, QC, Canada, 21–25 August 2022, pp. 826–832. Google Scholar
25. H. Zhang, M. Wang, Y. Liu and Y. Yuan , FDN: Feature decoupling network for head pose estimation, Proc. AAAI Conf. Artif. Intell. 34(7) (2020) 12789–12796. Google Scholar
26. Y. Zhou, C. Barnes, J. Lu, J. Yang and H. Li , On the continuity of rotation representations in neural networks, in IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019, pp. 5745–5753. Google Scholar
27. Y. Zhou and J. Gregson, Whenet: Real-time fine-grained estimation for wide range head pose, preprint (2020), arXiv:2005.10353. Google Scholar
28. X. Zhu, Z. Lei, X. Liu, H. Shi and S. Z. Li , Face alignment across large poses: A 3D solution, in IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016, pp. 146–155. Google Scholar
29. X. Zhu, Z. Lei, J. Yan, D. Yi and S. Z. Li , High-fidelity pose and expression normalization for face recognition in the wild, in IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015, pp. 787–796. Google Scholar