No Access

High Performance Kernel Architecture for Convolutional Neural Network Acceleration

Department of Electronics and Communications, Indian Institute of Information Technology Guwahati, Guwahati 781015, Assam, India

E-mail Address: anakhi22@gmail.com

Search for more papers by this author

Soumyajit Poddar

https://orcid.org/0000-0002-3476-2199

Department of Electronics and Communications, Indian Institute of Information Technology Guwahati, Guwahati 781015, Assam, India

E-mail Address: poddar18@gmail.com

Corresponding author.

Search for more papers by this author

, and

Hafizur Rahaman

Indian Institute of Engineering Science and Technology, Shibpur 711103, West Bengal, India

E-mail Address: rahaman_h@yahoo.co.in

Search for more papers by this author

https://doi.org/10.1142/S0218126621502662Cited by:3 (Source: Crossref)

Abstract

Convolutional neural networks (CNNs) have emerged as a prominent choice in artificial intelligence tasks. Recent advancements in CNN designs have greatly improved the performance and energy-efficiency of several computation-intensive applications. However, in real-time applications, greater accuracy of CNN is attained at the expense of very high computational cost and complexity. Further, the implementation of real-time CNN on embedded platforms is highly challenging due to resource and power constraints. This paper addresses the aforesaid computational complexity and presents an accelerator architecture accompanied by a novel kernel design to improve overall CNN performance. The proposed kernel design introduces a computing mechanism that reduces the data movement cost in terms of computational cycle count (latency) by parallelizing the convolution processing elements. This architecture takes advantage of the overlap of spatially adjacent data. The performance of the proposed architecture is also analyzed for multiple hyper-parameter configurations. The proposed accelerator achieves an average of $16 \times$ $16 \times$ improvement in reduction of execution time than the conventional computing architecture. To analyze the proposed architecture’s performance, we validate the architecture with AlexNet and VGG-16 CNN models. The proposed accelerator architecture achieves an average of $1.7 \times$ $1.7 \times$ throughput improvement over state-of-the-art accelerators.

This paper was recommended by Regional Editor Emre Salman.

Keywords:

References

1. Y. LeCun, Y. Bengio and G. Hinton , Deep learning, Nature 521 (2015) 436–444. Crossref, Web of Science, Google Scholar
2. T. Roska and L. O. Chua , The CNN universal machine: An analogic array computer, IEEE Trans. Circuits Syst. II Analog Digit. Signal Process. 40 (1993) 163–173. Crossref, Web of Science, Google Scholar
3. P. L. Venetianer, F. Werblin, T. Roska and L. O. Chua , Analogic CNN algorithms for some image compression and restoration tasks, IEEE Trans. Circuits Syst. I Fundam. Theory Appl. 42 (1995) 278–284. Crossref, Google Scholar
4. J. Müller, R. Wittig, J. Müller and R. Tetzlaff , An improved cellular nonlinear network architecture for binary and grayscale image processing, IEEE Trans. Circuits Syst. II Exp. Brief. 65 (2016) 1084–1088. Crossref, Web of Science, Google Scholar
5. K. He, X. Zhang, S. Ren and J. Sun , Deep residual learning for image recognition, in IEEE Proc. Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 770–778. Crossref, Google Scholar
6. Y.-H. Chen, J. Emer and V. Sze , Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks, ACM SIGARCH Comput. Architect. News 44 (2016) 367–379. Crossref, Google Scholar
7. A. Krizhevsky, I. Sutskever and H. Geoffrey , Imagenet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems (NeurIPS, 2012), pp. 1097–1105. Google Scholar
8. K. Simonyan and A. Zisserman , Very deep convolutional networks for large-scale image recognition, Int. Conf. Learning Representations (ICLR), San Diego, CA, USA, 2015, pp. 40–54. Google Scholar
9. T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen and O. Temam , DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning, ACM Sigplan Not. 49 (2014) 269–284. Crossref, Google Scholar
10. Y. Chen et al., DaDianNao: A machine-learning supercomputer, in IEEE/ACM Proc. Int. Symp. Microarchitecture, Cambridge, United Kingdom, 2014, pp. 609–622. Crossref, Google Scholar
11. D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou and Y. Chen , PuDianNao: A polyvalent machine learning accelerator, ACM SIGARCH Comput. Architect. News 43 (2015) 369–381. Crossref, Google Scholar
12. M. N. Bojnordi and E. Ipek , Memristive boltzmann machine: A hardware accelerator for combinatorial optimization and deep learning, IEEE Int. Symp. High Performance Computer Architecture (HPCA) (IEEE, 2016), pp. 1–13. Crossref, Google Scholar
13. C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao and J. Cong , Optimizing FPGA-based accelerator design for deep convolutional neural networks, in ACM Proc. Intl. Symp. Field-Programmable Gate Arrays, Monterey, CA, USA, February 2015, pp. 161–170. Google Scholar
14. W. Samuel, W. Andrew and P. David , Roofline: An insightful visual performance model for multicore architectures, Commun. ACM 52 (2009) 65–72. Crossref, Web of Science, Google Scholar
15. M. T. Hailesellasie and S. R. Hasan , MulNet: A flexible CNN processor with higher resource utilization efficiency for constrained devices, IEEE Access 7 (2019) 47509–47524. Crossref, Web of Science, Google Scholar
16. Y.-H. Chen, T. Krishna, J. S. Emer and V. Sze , Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks, IEEE J. Solid-State Circuits 52 (2016) 127–138. Crossref, Web of Science, Google Scholar
17. Y.-H. Chen, T.-J. Yang, J. Emer and V. Sze , Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices, IEEE J. Emerg. Sel. Top. Circuits Syst. 9 (2019) 292–308. Crossref, Web of Science, Google Scholar
18. R. Struharik and B. Vukobratović , AIScale-A coarse grained reconfigurable CNN hardware accelerator, IEEE East-West Design & Test Symp., Novi Sad, Serbia, November 2017, pp. 1–9. Google Scholar
19. V. Gokhale, A. Zaidy, A. X. M. Chang and E. Culurciello , Snowflake: An efficient hardware accelerator for convolutional neural networks, IEEE Intl. Symp. Circuits and Systems, Baltimore, MD, USA, September 2017, pp. 1–4. Google Scholar
20. L. Cavigelli and L. Benini , Origami: A 803-gop/s/w convolutional network accelerator, IEEE Trans. Circuits Syst. Video Technol. 27 (2016) 2461–2475. Crossref, Web of Science, Google Scholar
21. Y. Shen, M. Ferdman and P. Milder , Maximizing CNN accelerator efficiency through resource partitioning, ACM/IEEE Annu. Intl. Symp. Computer Architecture, Toronto, ON, Canada, December 2017, pp. 535–547. Google Scholar
22. J. Jo, S. Kim and I.-C. Park, Energy-efficient convolution architecture based on rescheduled dataflow, IEEE Trans. Circuits and Syst. I Regular Papers 65 (2018) 4196–4207. Google Scholar
23. Y. Ma, Y. Cao, S. Vrudhula and J.-S. Seo , Optimizing the convolution operation to accelerate deep neural networks on FPGA, IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 26 (2018) 1354–1367. Crossref, Web of Science, Google Scholar
24. X. Lian, Z. Liu, Z. Song, J. Dai, W. Zhou and X. Ji, High-performance FPGA-based CNN accelerator with block-floating-point arithmetic, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., Vol. 27, No. 8 (2019) Early Access, pp. 1874–1885. Google Scholar
25. H. Pourmeidani, S. Sheikhfaal, R. Zand and R. F. DeMara , Probabilistic interpolation recoder for energy-error-product efficient dbns with p-bit devices, Trans. Emerg. Top. Comput. (2020) Early Access (https://ieeexplore.ieee.org/document/8954781). Google Scholar
26. J. Redmon, S. Divvala, R. Girshick and A. Farhadi , You Only Look Once: Unified, real-time object detection, in Proc. Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 779–788. Crossref, Google Scholar
27. J. Redmon and A. Farhadi , YOLO9000: better, faster, stronger, in Proc. Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 7263–7271. Crossref, Google Scholar
28. J. Remdon and A. Farhadi, YOLOv3: An incremental improvement, preprint (2018), arXiv:1804.02767. Google Scholar
29. D. Xiao, F. Shan, Z. Li, B. T. Le, X. Liu and X. Li , A target detection model based on improved tiny-yolov3 under the environment of mining truck, IEEE Access 7 (2019) 123757–123764. Crossref, Web of Science, Google Scholar
30. A. Ahmad and M. A. Pasha , FFConv: An fpga-based accelerator for fast convolution layers in convolutional neural networks, Trans. Embed. Comput. Syst. (TECS) 19 (2020) 1–24. Crossref, Web of Science, Google Scholar
31. B. Asgari, R. Hadidi, T. Krishna, H. Kim and S. Yalamanchili , ALRESCHA: A lightweight reconfigurable sparse-computation accelerator, Int. Symp. High Performance Computer Architecture (HPCA) (IEEE, 2020), pp. 249–260. Crossref, Google Scholar
32. E. Qin, A. Samajdar, H. Kwon, V. Nadella, S. Srinivasan, D. Das, B. Kaul and T. Krishna , Sigma: A sparse and irregular gemm accelerator with flexible interconnects for dnn training, Int. Symp. High Performance Computer Architecture (HPCA) (IEEE, 2020), pp. 58–70. Crossref, Google Scholar
33. G. Li, P. Wang, Z. Liu, C. Leng and J. Cheng , Hardware acceleration of CNN with one-hot quantization of weights and activations, Design, Automation & Test in Europe Conf. & Exhibition (DATE) (IEEE, 2020), pp. 971–974. Crossref, Google Scholar
34. Z.-G. Liu, P. N. Whatmough and M. Mattina , Systolic tensor array: An efficient structured-sparse GEMM accelerator for mobile CNN inference, Comput. Architect. Lett. 19 (2020) 34–37. Crossref, Web of Science, Google Scholar
35. L. Lu, J. Xie, R. uang, H. J. Zhang, W. Lin and Y. Liang , An efficient hardware accelerator for sparse convolutional neural networks on FPGAs, Int. Symp. Field-Programmable Custom Computing Machines (FCCM) (IEEE, 2020), pp. 17–25. Google Scholar
36. A. Hazarika, S. Poddar and H. Rahaman , Hardware efficient convolution processing unit for deep neural networks, 2019 2nd International Symposium on Devices, Circuits and Systems (ISDCS) (IEEE, 2019), pp. 1–4. Google Scholar
37. C. Szegedy, W. Liu, J. Yangqing, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke and A. Rabinovich , Going deeper with convolutions, in IEEE Proc. Conf. Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 1–9. Crossref, Google Scholar
38. F. Iandola, S. Han, M. Moskewicz, K. Ashraf, W. Dally and K. Keutzer, SqueezeNet:AlexNet-level accuracy with $50 \times$ $50 \times$ fewer parameters and 0.5MB model size, preprint (2016), arXiv:1602.07360. Google Scholar
39. D. Gschwend, Zynqnet: An FPGA-Accelerated embedded convolutional neural network, Master’s thesis (Swiss Federal Institute of Technology Zurich, 2016). Google Scholar
40. BAIR, Caffe: Deep learning framework, http://caffe.berkeleyvision.org/. Google Scholar
41. Y. LeCun, L. Bottou, Y. Bengio and P. Haffner , Gradient-based learning applied to document recognition, Proc. IEEE 86 (1998) 2278–2324. Crossref, Web of Science, Google Scholar
42. Keras, https://keras.io/. Google Scholar
43. C. Zhang, G. Sun, Z. Fang, P. Zhou, P. Pan and J. Cong , Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks, Trans. Comput.-Aided Design Integrated Circuits Syst. 38 (2018) 2072–2085. Crossref, Web of Science, Google Scholar