CNN Accelerator Using Proposed Diagonal Cyclic Array for Minimizing Memory Accesses

Son, Hyun-Wook; Al-Hamid, Ali A.; Na, Yong-Seok; Lee, Dong-Yeong; Kim, Hyung-Won

doi:10.32604/cmc.2023.038760

Open Access icon Open Access

ARTICLE

CNN Accelerator Using Proposed Diagonal Cyclic Array for Minimizing Memory Accesses

by Hyun-Wook Son¹, Ali A. Al-Hamid^1,2, Yong-Seok Na¹, Dong-Yeong Lee¹, Hyung-Won Kim^1,*

1 Department of Electronics, College of Electrical and Computer Engineering, Chungbuk National University, Cheongju, 28644, Korea
2 Department of Electrical Engineering, College of Engineering, Al-Azhar University, Cairo, 11651, Egypt

* Corresponding Author: Hyung-Won Kim. Email: email

Computers, Materials & Continua 2023, 76(2), 1665-1687. https://doi.org/10.32604/cmc.2023.038760

Received 28 December 2022; Accepted 12 April 2023; Issue published 30 August 2023

Abstract

This paper presents the architecture of a Convolution Neural Network (CNN) accelerator based on a new processing element (PE) array called a diagonal cyclic array (DCA). As demonstrated, it can significantly reduce the burden of repeated memory accesses for feature data and weight parameters of the CNN models, which maximizes the data reuse rate and improve the computation speed. Furthermore, an integrated computation architecture has been implemented for the activation function, max-pooling, and activation function after convolution calculation, reducing the hardware resource. To evaluate the effectiveness of the proposed architecture, a CNN accelerator has been implemented for You Only Look Once version 2 (YOLOv2)-Tiny consisting of 9 layers. Furthermore, the methodology to optimize the local buffer size with little sacrifice of inference speed is presented in this work. We implemented the proposed CNN accelerator using a Xilinx Zynq ZCU102 Ultrascale+ Field Programmable Gate Array (FPGA) and ISE Design Suite. The FPGA implementation uses 34,336 Look Up Tables (LUTs), 576 Digital Signal Processing (DSP) blocks, and an on-chip memory of only 58 KB, and it could achieve accuracies of 57.92% and 56.42% mean Average Precession @0.5 thresholds for intersection over union (mAP@0.5) using quantized 16-bit and 8-bit full integer data manipulation with only 0.68% as a loss for 8-bit version and computation time of 137.9 and 69 ms for each input image respectively using a clock speed of 200 MHz. These speeds are expected to be doubled five times using a clock speed of 1 GHz if implemented in a silicon System on Chip (SoC) using a sub-micron process.

Keywords

CNN; accelerator; systolic array; memory optimization; YOLOv2-tiny; mAP@0.5

Cite This Article

APA Style

Son, H., Al-Hamid, A.A., Na, Y., Lee, D., Kim, H. (2023). CNN accelerator using proposed diagonal cyclic array for minimizing memory accesses. Computers, Materials & Continua, 76(2), 1665-1687. https://doi.org/10.32604/cmc.2023.038760

Vancouver Style

Son H, Al-Hamid AA, Na Y, Lee D, Kim H. CNN accelerator using proposed diagonal cyclic array for minimizing memory accesses. Comput Mater Contin. 2023;76(2):1665-1687 https://doi.org/10.32604/cmc.2023.038760

IEEE Style

H. Son, A. A. Al-Hamid, Y. Na, D. Lee, and H. Kim, “CNN Accelerator Using Proposed Diagonal Cyclic Array for Minimizing Memory Accesses,” Comput. Mater. Contin., vol. 76, no. 2, pp. 1665-1687, 2023. https://doi.org/10.32604/cmc.2023.038760

BibTex EndNote RIS

Copyright © 2023 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

CNN Accelerator Using Proposed Diagonal Cyclic Array for Minimizing Memory Accesses

Abstract

Keywords

Cite This Article

713

378

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Share Link