Processing math: 100%
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at customercare@wspc.com for any enquiries.

SEARCH GUIDE  Download Search Tip PDF File

  • articleNo Access

    A Deep Learning Network-on-Chip (NoC)-Based Switch-Router to Enhance Information Security in Resource-Constrained Devices

    In a resource-constrained environment of the 21st century, the use of hardware-based reconfigurable systems such as Field Programmable Gate Array (FPGAs) is considered an effective way to enhance information security. In comparison with traditional custom circuitry that does not give a flexible approach, it is observed that the reconfigurable hardware shows an excellent potential for cyber security by increasing hardware speeds and flexibility. Therefore, in a quest to integrate multi-core systems, the Network-on-Chip (NoC) has become one of the popular widespread techniques to maximize router security. Due to the significant overhead of chip space and the power consumption of the routers, it is substantially more expensive to construct as compared to a bus-based system. The control component (CC) interacts with the networks that inject packets based on router switching and activity. These control components are coupled with each network to produce a system of controlled networks. The system is further linked with CFM or a Centralized Fabric Manager, which serves as the network’s focal point. After that, the CFM runs the algorithm regularly. The analytic parameters comprise flip flop, power, latency, number of lookup tables (LUTs), and throughput. In the proposed method, the number of LUTs is 0.35mm2, the flip flop is 3.5mm2, the power is 3.4μW, the latency is 5941ns, and the planned throughput is 0.56 flits/cycle. Results indicate that the crossbar switch reduces errors and minimizes the delay in the architecture’s outcome level, which further overcomes the descriptions of performance, power throughput, and area delay parameters. The findings of the research can be useful to enhance information security among lightweight devices besides minimizing the chances of network attacks in today’s dynamic and complex cyberspace.

  • articleNo Access

    Design and Implementation of Face Detection Architecture for Heterogeneous System-on-Chip

    The seminal work of Viola and Jones for automatic face detection is widely used in many human–computer interaction and computer vision applications. On analyzing the existing face detection architectures, we observed that integral image calculation, feature computation in cascaded classifier, and recursive scanning of image with sliding window at multiple scales are the major reasons which increase the memory and time complexity of the algorithm. Therefore, in this paper, we have proposed a hardware–software co-design of Viola–Jones face detector for System-on-Chip (SoC). In the proposed architecture, integral image computation and cascaded classifier sub-modules are implemented on the hardware — Programmable Logic FPGA (PL-FPGA), while the image scaling and nonmaximum suppression sub-modules are implemented on the software — Processing System ARM (PS-ARM). Concepts of pipelining, folding, and parallel processing are effectively utilized to produce an optimum design architecture. The proposed architecture has been tested on PYNQ-Z1 board. The implementation results in a processing speed of 95 fps with PL and PS clocks of 100MHz and 650MHz, respectively, for an image of QVGA resolution. Results analysis demonstrates that the proposed architecture has minimum resource requirement as compared to state-of-the-art implementations, which facilitates and promotes the usage of resource-constrained low-cost ZYNQ SoC for face detection.

  • articleNo Access

    Improved Synthesis of Generalized Parallel Counters on FPGAs Using Only LUTs

    Generalized parallel counters (GPCs) are frequently used to construct high speed compressor trees on field programmable gate arrays (FPGAs). The introduction of fast carry-chain in FPGAs has greatly improved the performance of these elements. Evidently, a large number of GPCs have been proposed in literature that use a combination of look-up tables (LUTs) and carry-chains. In this paper, we take an alternate approach and try to eliminate the carry-chain from the GPC structure. We present a heuristic that aims at synthesizing GPCs on FPGAS using only the general LUT fabric. The resultant GPCs are then easily pipelined by placing registers at the output node of each LUT. We have used our heuristic on various GPCs reported in prior work. Our heuristic successfully eliminates the carry-chain from the GPC structure with an increase in LUT count in some GPCs. Experimentation using Xilinx FPGAs shows that filter systems constructed using our GPCs show an improvement in speed and power performance and a comparable area performance.

  • articleNo Access

    Evaluating Multiple Streams on Heterogeneous Platforms

    Using multiple streams can improve the overall system performance by mitigating the data transfer overhead on heterogeneous systems. Prior work focuses a lot on GPUs but little is known about the performance impact on (Intel Xeon) Phi. In this work, we apply multiple streams into six real-world applications on Phi. We then systematically evaluate the performance benefits of using multiple streams. The evaluation work is performed at two levels: the microbenchmarking level and the real-world application level. Our experimental results at the microbenchmark level show that data transfers and kernel execution can be overlapped on Phi, while data transfers in both directions are performed in a serial manner. At the real-world application level, we show that both overlappable and non-overlappable applications can benefit from using multiple streams (with an performance improvement of up to 24%). We also quantify how task granularity and resource granularity impact the overall performance. Finally, we present a set of heuristics to reduce the search space when determining a proper task granularity and resource granularity. To conclude, our evaluation work provides lots of insights for runtime and architecture designers when using multiple streams on Phi.

  • articleNo Access

    HIGH-PERFORMANCE MATHEMATICAL FUNCTIONS FOR SINGLE-CORE ARCHITECTURES

    Nowadays high-performance computing (HPC) architectures are designed to resolve assorted sophisticated scientific as well as engineering problems across an ever intensifying number of HPC and professional workloads. Application and computation of key trigonometric functions sine and cosine are in all spheres of our daily life, yet fairly time consuming task in high-performance numerical simulations. In this paper, we have delivered a detailed deliberation of how the micro-architecture of single-core Itanium® and Alpha 21264/21364 processors as well as the manual optimization techniques improve the computing performance of several mathematical functions. On describing the detailed algorithm and its execution pattern on the processor, we have confirmed that the processor micro-architecture side by side manual optimization techniques ameliorate computing performance significantly as compared to not only the standard math library's built-in functions with compiler optimizing options but also Intel® Itanium® library's highly optimized mathematical functions.

  • articleNo Access

    Efficient Audio Filter Using Folded Pipelining Architecture Based on Retiming Using Evolutionary Computation

    It is important in digital signal processing (DSP) architectures to minimize the silicon area of the integrated circuits. This can be achieved by reducing the number of functional units such as adders and multipliers. In literature, folding technique is used to reduce the functional units by executing multiple algorithm operations on a single functional unit. Register minimization techniques are used to reduce the number of registers in a folded architecture. Retiming is a technique that needs to be performed before applying folding. In this paper, retiming is performed based on nature inspired evolutionary computation method. This technique generates the database of solutions from which best solution can be picked for folding further.

    As a part of this work, an efficient folded noise removal audio filter prototype is designed as an application example using evolutionary computation-based retiming and folding with register minimization. Folding technique will however increase the number of registers while multiplexing datapath adder and multiplier elements. Register minimization technique is used after folding to reduce the number of registers. After obtaining retimed, folded filter architecture, low level synthesis is performed which involves mapping of datapath adder and multiplier blocks to actual hardware. Various architectures of adders and multipliers are compared in area-power-performance space and depending on the user defined constraint, folded architecture with specific combination of data path elements is mapped on to hardware. A framework is designed in this paper to automate the entire process which reduces the design cycle time. All the designed filters are targeted for ASIC implementation. The results are compared and are provided as part of simulation results.

  • articleNo Access

    Pipelined-Scheduling of Multiple Embedded Applications on a Multi-Processor-SoC

    Due to clock and power constraints, it is hard to extract more power out of single core architectures. Thus, multi-core systems are now the architecture of choice to provide the needed computing power. In embedded system, multi-processor system-on-a-chip (MPSoC) is widely used to provide the needed power to effectively run complex embedded applications. However, to effectively utilize an MPSoC system, tools to generate optimized schedules is highly needed. In this paper, we design an integrated approach to task scheduling and memory partitioning of multiple applications utilizing the MPSoC system simultaneously. This is in contrast to the traditional decoupled approach that looks at task scheduling and memory partitioning as two separate problems. Our framework is also based on pipelined scheduling to increase the throughput of the system. Results on different benchmarks show the effectiveness of our techniques.

  • articleNo Access

    A Fault Tolerant Parallelism Approach for Implementing High-Throughput Pipelined Advanced Encryption Standard

    Advanced Encryption Standard (AES) is the most popular symmetric encryption method, which encrypts streams of data by using symmetric keys. The current preferable AES architectures employ effective methods to achieve two important goals: protection against power analysis attacks and high-throughput. Based on a different architectural point of view, we implement a particular parallel architecture for the latter goal, which is capable of implementing a more efficient pipelining in field-programmable gate array (FPGA). In this regard, all intermediate registers which have a role for unrolling the main loop will be removed. Also, instead of unrolling the main loop of AES algorithm, we implement pipelining structure by replicating nonpipelined AES architectures and using an auto-assigner mechanism for each AES block. By implementing the new pipelined architecture, we achieve two valuable advantages: (a) solving single point of failure problem when one of the replicated parts is faulty and (b) deploying the proposed design as a fault tolerant AES architecture. In addition, we put emphasis on area optimization for all four AES main functions to reduce the overhead associated with AES block replication. The simulation results show that the maximum frequency of our proposed AES architecture is 675.62MHz, and for AES128 the throughput is 86.5Gbps which is 30.9% better than its closest existing competitor.