Please login to be able to save your searches and receive alerts for new content matching your search criteria.
Nowadays high-performance computing (HPC) architectures are designed to resolve assorted sophisticated scientific as well as engineering problems across an ever intensifying number of HPC and professional workloads. Application and computation of key trigonometric functions sine and cosine are in all spheres of our daily life, yet fairly time consuming task in high-performance numerical simulations. In this paper, we have delivered a detailed deliberation of how the micro-architecture of single-core Itanium® and Alpha 21264/21364 processors as well as the manual optimization techniques improve the computing performance of several mathematical functions. On describing the detailed algorithm and its execution pattern on the processor, we have confirmed that the processor micro-architecture side by side manual optimization techniques ameliorate computing performance significantly as compared to not only the standard math library's built-in functions with compiler optimizing options but also Intel® Itanium® library's highly optimized mathematical functions.
Advanced Encryption Standard (AES) is the most popular symmetric encryption method, which encrypts streams of data by using symmetric keys. The current preferable AES architectures employ effective methods to achieve two important goals: protection against power analysis attacks and high-throughput. Based on a different architectural point of view, we implement a particular parallel architecture for the latter goal, which is capable of implementing a more efficient pipelining in field-programmable gate array (FPGA). In this regard, all intermediate registers which have a role for unrolling the main loop will be removed. Also, instead of unrolling the main loop of AES algorithm, we implement pipelining structure by replicating nonpipelined AES architectures and using an auto-assigner mechanism for each AES block. By implementing the new pipelined architecture, we achieve two valuable advantages: (a) solving single point of failure problem when one of the replicated parts is faulty and (b) deploying the proposed design as a fault tolerant AES architecture. In addition, we put emphasis on area optimization for all four AES main functions to reduce the overhead associated with AES block replication. The simulation results show that the maximum frequency of our proposed AES architecture is 675.62MHz, and for AES128 the throughput is 86.5Gbps which is 30.9% better than its closest existing competitor.
Generalized parallel counters (GPCs) are frequently used to construct high speed compressor trees on field programmable gate arrays (FPGAs). The introduction of fast carry-chain in FPGAs has greatly improved the performance of these elements. Evidently, a large number of GPCs have been proposed in literature that use a combination of look-up tables (LUTs) and carry-chains. In this paper, we take an alternate approach and try to eliminate the carry-chain from the GPC structure. We present a heuristic that aims at synthesizing GPCs on FPGAS using only the general LUT fabric. The resultant GPCs are then easily pipelined by placing registers at the output node of each LUT. We have used our heuristic on various GPCs reported in prior work. Our heuristic successfully eliminates the carry-chain from the GPC structure with an increase in LUT count in some GPCs. Experimentation using Xilinx FPGAs shows that filter systems constructed using our GPCs show an improvement in speed and power performance and a comparable area performance.
The seminal work of Viola and Jones for automatic face detection is widely used in many human–computer interaction and computer vision applications. On analyzing the existing face detection architectures, we observed that integral image calculation, feature computation in cascaded classifier, and recursive scanning of image with sliding window at multiple scales are the major reasons which increase the memory and time complexity of the algorithm. Therefore, in this paper, we have proposed a hardware–software co-design of Viola–Jones face detector for System-on-Chip (SoC). In the proposed architecture, integral image computation and cascaded classifier sub-modules are implemented on the hardware — Programmable Logic FPGA (PL-FPGA), while the image scaling and nonmaximum suppression sub-modules are implemented on the software — Processing System ARM (PS-ARM). Concepts of pipelining, folding, and parallel processing are effectively utilized to produce an optimum design architecture. The proposed architecture has been tested on PYNQ-Z1 board. The implementation results in a processing speed of 95 fps with PL and PS clocks of 100MHz and 650MHz, respectively, for an image of QVGA resolution. Results analysis demonstrates that the proposed architecture has minimum resource requirement as compared to state-of-the-art implementations, which facilitates and promotes the usage of resource-constrained low-cost ZYNQ SoC for face detection.