Special Issue — Selected Papers from the 19th IEEE International Symposium on Multimedia (ISM 2017); Guest Editors: Han C. W. Hsiao and Robert MertensNo Access

Convolutional Nonlinear Differential Recurrent Neural Networks for Crowd Scene Understanding

Naifan Zhuang

Department of Computer Science, University of Central Florida, 4000 Central Florida Blvd., Orlando, FL 32816, USA

E-mail Address: zhuangnaifan@knights.ucf.edu

Corresponding author.

Search for more papers by this author

The Duc Kieu

Department of Computing and Information Technology, University of the West Indies, St. Augustine Campus, St. Augustine, Trinidad and Tobago, West Indies

E-mail Address: ktduc0323@yahoo.com.au

Search for more papers by this author

Jun Ye

Microsoft Corporation, 1 Microsoft Way, Redmond, WA 98052, USA

E-mail Address: yeju@microsoft.com

Search for more papers by this author

, and

Kien A. Hua

Department of Computer Science, University of Central Florida, 4000 Central Florida Blvd., Orlando, FL 32816, USA

E-mail Address: kienhua@cs.ucf.edu

Search for more papers by this author

https://doi.org/10.1142/S1793351X18400196Cited by:3 (Source: Crossref)

Abstract

With the growth of crowd phenomena in the real world, crowd scene understanding is becoming an important task in anomaly detection and public security. Visual ambiguities and occlusions, high density, low mobility, and scene semantics, however, make this problem a great challenge. In this paper, we propose an end-to-end deep architecture, convolutional nonlinear differential recurrent neural networks (CNDRNNs), for crowd scene understanding. CNDRNNs consist of GoogleNet Inception V3 convolutional neural networks (CNNs) and nonlinear differential recurrent neural networks (RNNs). Different from traditional non-end-to-end solutions which separate the steps of feature extraction and parameter learning, CNDRNN utilizes a unified deep model to optimize the parameters of CNN and RNN hand in hand. It thus has the potential of generating a more harmonious model. The proposed architecture takes sequential raw image data as input, and does not rely on tracklet or trajectory detection. It thus has clear advantages over the traditional flow-based and trajectory-based methods, especially in challenging crowd scenarios of high density and low mobility. Taking advantage of CNN and RNN, CNDRNN can effectively analyze the crowd semantics. Specifically, CNN is good at modeling the semantic crowd scene information. On the other hand, nonlinear differential RNN models the motion information. The individual and increasing orders of derivative of states (DoS) in differential RNN can progressively build up the ability of the long short-term memory (LSTM) gates to detect different levels of salient dynamical patterns in deeper stacked layers modeling higher orders of DoS. Lastly, existing LSTM-based crowd scene solutions explore deep temporal information and are claimed to be “deep in time.” Our proposed method CNDRNN, however, models the spatial and temporal information in a unified architecture and achieves “deep in space and time.” Extensive performance studies on the Violent-Flows, CUHK Crowd, and NUS-HGA datasets show that the proposed technique significantly outperforms state-of-the-art methods.

Keywords:

Remember to check out the Most Cited Articles!
Check out our titles in Semantic Computing!

References

1. T. Li, H. Chang, M. Wang, B. Ni, R. Hong and S. Yan , Crowded scene analysis: A survey, IEEE Trans. Circuits Syst. Video Technol. 25 (3) (2015) 367–386. Crossref, Web of Science, Google Scholar
2. T. Yusufu, N. Zhuang, K. Li and K. A. Hua , Relational learning based happiness intensity analysis in a group, in Proc. IEEE Int. Symp. Multimedia, 2016, pp. 353–358. Google Scholar
3. M. Marsden, K. McGuinness, S. Little and N. E. O’Connor , Holistic features for real-time crowd behaviour anomaly detection, in Proc. IEEE Int. Conf. Image Processing, 2016, pp. 918–922. Google Scholar
4. R. Mehran, A. Oyama and M. Shah , Abnormal crowd behavior detection using social force model, in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009, pp. 935–942. Google Scholar
5. S. Wu, B. E. Moore and M. Shah , Chaotic invariants of Lagrangian particle trajectories for anomaly detection in crowded scenes, in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2010, pp. 2054–2060. Google Scholar
6. R. Mehran, B. E. Moore and M. Shah , A streakline representation of flow in crowded scenes, in Proc. European Conf. Computer Vision, 2010, pp. 439–452. Google Scholar
7. X. Wang, X. Yang, X. He, Q. Teng and M. Gao , A high accuracy flow segmentation method in crowded scenes based on streakline, Optik 125 (3) (2014) 924–929. Crossref, Web of Science, Google Scholar
8. P.-M. Jodoin, Y. Benezeth and Y. Wang , Meta-tracking for video scene understanding, in Proc. 10th IEEE Int. Conf. Advanced Video and Signal Based Surveillance, 2013, pp. 1–6. Google Scholar
9. Y. Cong, J. Yuan and J. Liu , Abnormal event detection in crowded scenes using sparse representation, Pattern Recognit. 46 (7) (2013) 1851–1864. Crossref, Web of Science, Google Scholar
10. L. Kratz and K. Nishino , Tracking pedestrians using local spatio-temporal motion patterns in extremely crowded scenes, IEEE Trans. Pattern Anal. Mach. Intell. 34 (5) (2012) 987–1002. Crossref, Web of Science, Google Scholar
11. L. Kratz and K. Nishino , Anomaly detection in extremely crowded scenes using spatio-temporal motion pattern models, in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009, pp. 1446–1453. Google Scholar
12. J. Shao, C. Change Loy and X. Wang , Scene-independent group profiling in crowd, in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2014, pp. 2219–2226. Google Scholar
13. H. Su, Y. Dong, J. Zhu, H. Ling and B. Zhang , Crowd scene understanding with coherent recurrent neural networks, in Proc. 25th Int. Joint Conf. Artificial Intelligence (AAAI Press, 2016), pp. 3469–3476. Google Scholar
14. H. Mousavi, S. Mohammadi, A. Perina, R. Chellali and V. Murino , Analyzing tracklets for the detection of abnormal crowd behavior, in Proc. IEEE Winter Conf. Applications of Computer Vision, 2015, pp. 148–155. Google Scholar
15. B. Zhou, X. Wang and X. Tang , Random field topic model for semantic region analysis in crowded scenes from tracklets, in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2011, pp. 3441–3448. Google Scholar
16. B. Zhou, X. Wang and X. Tang , Understanding collective crowd behaviors: Learning a mixture model of dynamic pedestrian-agents, in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2012, pp. 2871–2878. Google Scholar
17. W. Chongjing, Z. Xu, Z. Yi and L. Yuncai , Analyzing motion patterns in crowded scenes via automatic tracklets clustering, China Commun. 10 (4) (2013) 144–154. Crossref, Web of Science, Google Scholar
18. C. Wang, X. Zhao, Z. Wu and Y. Liu , Motion pattern analysis in crowded scenes based on hybrid generative-discriminative feature maps, in Proc. 20th IEEE Int. Conf. Image Processing, 2013, pp. 2837–2841. Google Scholar
19. A. Krizhevsky, I. Sutskever and G. E. Hinton , ImageNet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems 25 (NIPS Foundation, 2012), pp. 1097–1105. Google Scholar
20. S. Ali, Taming crowded visual scenes, Ph.D. thesis, University of Central Florida, Orlando, FL (2008). Google Scholar
21. S. Hochreiter and J. Schmidhuber , Long short-term memory, Neural Comput. 9 (8) (1997) 1735–1780. Crossref, Web of Science, Google Scholar
22. V. Veeriah, N. Zhuang and G.-J. Qi , Differential recurrent neural networks for action recognition, in Proc. IEEE Int. Conf. Computer Vision, 2015, pp. 4041–4049. Google Scholar
23. A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei and S. Savarese , Social LSTM: Human trajectory prediction in crowded spaces, in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2016. Google Scholar
24. J. Ye, K. Li, G.-J. Qi and K. A. Hua , Temporal order-preserving dynamic quantization for human action recognition from multimodal sensor streams, in Proc. 5th ACM Int. Conf. Multimedia Retrieval, 2015, pp. 99–106. Google Scholar
25. N. Zhuang, J. Ye and K. A. Hua , DLSTM approach to video modeling with hashing for large-scale video retrieval, in Proc. 23rd Int. Conf. Pattern Recognition, 2016, pp. 3222–3227. Google Scholar
26. N. Zhuang, J. Ye and K. A. Hua , Convolutional DLSTM for crowd scene understanding, in Proc. IEEE Int. Symp. Multimedia, 2017, pp. 61–68. Google Scholar
27. N. Zhuang, T. D. Kieu, G.-J. Qi and K. A. Hua, Deep differential recurrent neural networks, arXiv:1804.04192 [cs. cv]. Google Scholar
28. C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens and Z. Wojna , Rethinking the inception architecture for computer vision, in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2016, pp. 2818–2826. Google Scholar
29. T. Hassner, Y. Itcher and O. Kliper-Gross , Violent flows: Real-time detection of violent crowd behavior, in Proc. IEEE Computer Society Conf. Computer Vision and Pattern Recognition Workshops, 2012, pp. 1–6. Google Scholar
30. B. Ni, S. Yan and A. Kassim , Recognizing human group activities with localized causalities, in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009, pp. 1470–1477. Google Scholar
31. M. Hu, S. Ali and M. Shah , Detecting global motion patterns in complex videos, in Proc. 19th Int. Conf. Pattern Recognition, 2008, pp. 1–5. Google Scholar
32. H. Min, S. Ali and M. Shah , Learning motion patterns in crowded scenes using motion flow field, in Proc. Int. Conf. Pattern Recognition, 2008. Google Scholar
33. S. C. Shadden, F. Lekien and J. E. Marsden , Definition and properties of Lagrangian coherent structures from finite-time Lyapunov exponents in two-dimensional aperiodic flows, Physica D 212 (3) (2005) 271–304. Crossref, Web of Science, Google Scholar
34. S. Ali and M. Shah , A Lagrangian particle dynamics approach for crowd flow segmentation and stability analysis, in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2007, pp. 1–6. Google Scholar
35. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke and A. Rabinovich , Going deeper with convolutions, in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2015, pp. 1–9. Google Scholar
36. M. P. Cuéllar, M. Delgado and M. Pegalajar , An application of non-linear programming to train recurrent neural networks in time series prediction problems, in Enterprise Information Systems VII (Springer, 2007), pp. 95–102. Google Scholar
37. H. Mousavi, M. Nabi, H. Kiani, A. Perina and V. Murino , Crowd motion monitoring using tracklet-based commotion measure, in Proc. IEEE Int. Conf. Image Processing, 2015, pp. 2354–2358. Google Scholar
38. S. Mohammadi, H. Kiani, A. Perina and V. Murino , Violence detection in crowded scenes using substantial derivative, in Proc. 12th IEEE Int. Conf. Advanced Video and Signal Based Surveillance, 2015, pp. 1–6. Google Scholar
39. K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv:1409.1556 [cs. cv]. Google Scholar
40. Z. Cheng, L. Qin, Q. Huang, S. Yan and Q. Tian , Recognizing human group action by layered model with multiple cues, Neurocomputing 136 (2014) 124–135. Crossref, Web of Science, Google Scholar
41. G. Zhu, S. Yan, T. X. Han and C. Xu , Generative group activity analysis with quaternion descriptor, in Proc. Int. Conf. Multimedia Modeling (Springer, 2011), pp. 1–11. Google Scholar
42. N.-G. Cho, Y.-J. Kim, U. Park, J.-S. Park and S.-W. Lee , Group activity recognition with group interaction zone based on relative distance between human objects, Int. J. Pattern Recognit. Artif. Intell. 29 (5) (2015) 1555007. Link, Web of Science, Google Scholar
43. B. Zhou, X. Tang and X. Wang , Measuring crowd collectiveness, in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2013, pp. 3049–3056. Google Scholar