Please login to be able to save your searches and receive alerts for new content matching your search criteria.
The paper emphasizes methods, architectures, and components for system-on-chip design. It describes the basic knowledge and skills for designing high-performance low-power embedded devices whose complexity increases exponentially, as so does the effort of designing them. Relying upon an appropriate design methodology which concentrates on reuse, executable specifications, and early error detection, these complexities can be mastered. The paper bundles these topics in order to provide a good understanding of all the problems involved. It shows how to go from description and verification to implementation and testing, presenting three systems-on-chip for three different wireless applications based on configurable processors and custom hardware accelerators.
A factor graph (FG) can be considered as a unified model combining a Bayesian network (BN) and a Markov random field (MRF). The inference mechanism of a FG can be used to perform reasoning under incompleteness and uncertainty, which is a challenging task in many intelligent systems and robotics. Unfortunately, a complete inference mechanism requires intense computations that introduces a long delay for the reasoning process to complete. Furthermore, in an energy-constrained system such as a mobile robot, it is required to have a very efficient inference process. In this paper, we present an embedded FG inference engine that employs a neural-inspired discretization mechanism. The engine runs on a system-on-chip (SoC) and is accelerated by its FPGA. We optimized our design to balance the trade-off between speed and hardware resource utilization. In our fully-optimized design, it can accelerate the inference process eight times faster than the normal execution, which is twice the speed-up gain achieved by a parallelized FG running on a PC. The experiments demonstrate that our design can be extended into an efficient reconfigurable computing machine.
Extracted features are widely used for image processing. Many research endeavors have been undertaken to extract significant features of fast moving images. Appropriate algorithm processing is necessary to extract features and provide features to the other modules in real time with low-cost embedded systems. The features from accelerated segment test (FAST) algorithm is renowned for feature extraction. FAST is composed of simple arithmetic operators. In this study, FAST is employed to implement the hardware accelerator in a field-programmable gate array for small embedded systems. Meanwhile, the threshold value in FAST affects the number of extracted features and the execution time. The precarious execution time makes it difficult for the system to schedule the timing of system functions and thus degrades the performance. An appropriate method is necessary to stabilize the execution time. A dynamic threshold controller in a FAST hardware accelerator is thus proposed to enable a stable execution time. A proportional integral controller composed of an adder, subtractor, and shifter is applied for low design implementation costs. The proposed approach occupies 2,263 slice flip-flops, 3,498 look-up tables, and 17 block RAMs in a Xilinx Virtex 5 FX field-programmable gate array. It requires 3.87ms for continuous 800×480 images from the KITTI benchmark.
In wearable devices, power consumption is a serious issue since wearable devices must maintain the power-on state at any time. In healthcare system, a variety of signal processing operations occupy a large portion of overall workload because it has periodic and heavy computational workloads. In this paper, we propose a low-power System on Chip (SoC) architecture for wearable healthcare devices. In order to reduce power consumption of processor, we design a hardware accelerator that handles signal processing and provides computation offloading. Furthermore, to minimize the area and maximize the performance of the accelerator, we optimize the operation bit-width by analyzing the frequency response. The low-power healthcare SoC was fabricated with 0.11μm CMOS process. Finally, we measured the power consumption of our chip and verified the applicability of the digital filter accelerator, which reduces the energy consumption for embedded processor.
Convolutional neural networks (CNNs) have emerged as a prominent choice in artificial intelligence tasks. Recent advancements in CNN designs have greatly improved the performance and energy-efficiency of several computation-intensive applications. However, in real-time applications, greater accuracy of CNN is attained at the expense of very high computational cost and complexity. Further, the implementation of real-time CNN on embedded platforms is highly challenging due to resource and power constraints. This paper addresses the aforesaid computational complexity and presents an accelerator architecture accompanied by a novel kernel design to improve overall CNN performance. The proposed kernel design introduces a computing mechanism that reduces the data movement cost in terms of computational cycle count (latency) by parallelizing the convolution processing elements. This architecture takes advantage of the overlap of spatially adjacent data. The performance of the proposed architecture is also analyzed for multiple hyper-parameter configurations. The proposed accelerator achieves an average of 16× improvement in reduction of execution time than the conventional computing architecture. To analyze the proposed architecture’s performance, we validate the architecture with AlexNet and VGG-16 CNN models. The proposed accelerator architecture achieves an average of 1.7× throughput improvement over state-of-the-art accelerators.
Noncoding RNAs (ncRNAs) have important functional roles in biological processes and have become a central research interest in modern molecular biology. However, how to find ncRNA attracts much more attention since ncRNA gene sequences do not have strong statistical signals, unlike protein coding genes. QRNA is a powerful program and has been widely used as an efficient analysis tool to detect ncRNA gene at present. Unfortunately, the O(L3) computing requirements and complicated data dependency greatly limit the usefulness of QRNA package with the explosion in gene database. In this paper, we present a fine-grained parallel QRNA prototype system, FPQRNA, for accelerating ncRNA gene detection application on FPGA chip. We propose a systolic-like array architecture with multiple PEs (Processing Elements). We partition the tasks by columns and assign tasks to PEs for load balance. We exploit data reuse schemes to reduce the need to load matrices from external memory. The experimental results show a speedup factor of more than 18× over the QRNA - 2.0.3c software running on a PC platform with AMD Phenom 9650 Quad CPU for pairwise sequence alignment with 996 residues, however the power consumption of our FPGA accelerator is only about 30% of that of the general-purpose microprocessors.