Processing math: 100%
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at customercare@wspc.com for any enquiries.

SEARCH GUIDE  Download Search Tip PDF File

  • articleNo Access

    ROW ORIENTED GAUSS ELIMINATION ON DISTRIBUTED MEMORY MULTIPROCESSORS

    This paper deals with the Gauss elimination for solving general dense linear systems on distributed memory multiprocessors. A row oriented parallel algorithm is proposed and implemented on the NCUBE distributed memory multiprocessor. We study in some detail various implementation choices which are important factors affecting the algorithm's performance. These factors include: mapping of rows into processors, pivoting implementation, message passing, communication granularity and pipelining. Experiments with these on the NCUBE are reported which illustrate the algorithms performance characteristics. Both standard and memory scaled speed up are measured which show that the algorithm is very efficient. Comparisons with previous approaches which mix row and column operations show an obvious gain in efficiency.

  • articleNo Access

    PARALLEL AND PIPELINED PARALLEL CONSECUTIVE SUMS ON A HYPERCUBE WITH APPLICATION TO RAY CASTING

    Communication penalty for parallel computation is related to message startup time and speed of data transmission between the host and processing elements (PEs). We propose two algorithms in this paper to show that the first factor can be alleviated by reducing the number of messages and the second by making the host-PE communication concurrent with computation on the PE array.

    The algorithms perform 2n consecutive sums of 2n numbers each on a hypercube of degree n. The first algorithm leaves one sum on each processor. It takes n steps to complete the sums and reduces the number of messages generated by a PE from 2n to n. The second algorithm sends all the sums back to the host as the sums are generated one by one. It takes 2n+n−1 steps to complete the sums in a pipeline so that one sum is completed every step after the initial (n−1) steps.

    We apply our second algorithm to the front-to-back composition for ray casting. For large number of rays, the efficiency and speedup of our algorithm are close to theoretically optimal values.

  • articleNo Access

    MODIFYING THE SINGULAR VALUE DECOMPOSITION ON THE CONNECTION MACHINE

    A fully parallel algorithm for updating and downdating the singular value decompositions (SVD’s) of an m-by-n(m≥n) matrix A is described. The algorithm uses similar chasing techniques for modifying the SVD’s described in [3], but requires fewer plane rotations, and can be implemented almost identically for both updating and downdating. Both cyclic and consecutive storage schemes are considered in parallel implementation. We show that the latter scheme outperforms the former on a distributed memory MIMD multiprocessor. We present the experimental results on the 32-node Connection Machine (CM-5).

  • articleNo Access

    Evaluating Multiple Streams on Heterogeneous Platforms

    Using multiple streams can improve the overall system performance by mitigating the data transfer overhead on heterogeneous systems. Prior work focuses a lot on GPUs but little is known about the performance impact on (Intel Xeon) Phi. In this work, we apply multiple streams into six real-world applications on Phi. We then systematically evaluate the performance benefits of using multiple streams. The evaluation work is performed at two levels: the microbenchmarking level and the real-world application level. Our experimental results at the microbenchmark level show that data transfers and kernel execution can be overlapped on Phi, while data transfers in both directions are performed in a serial manner. At the real-world application level, we show that both overlappable and non-overlappable applications can benefit from using multiple streams (with an performance improvement of up to 24%). We also quantify how task granularity and resource granularity impact the overall performance. Finally, we present a set of heuristics to reduce the search space when determining a proper task granularity and resource granularity. To conclude, our evaluation work provides lots of insights for runtime and architecture designers when using multiple streams on Phi.

  • articleNo Access

    PERFORMANCE STUDY OF LU FACTORIZATION WITH LOW COMMUNICATION OVERHEAD ON MULTIPROCESSORS

    In this paper, we make efficient use of asynchronous communications on the LU decomposition algorithm with pivoting and a column-scattered data decomposition to derive precise computational complexities. We then compare these results with experiments on the Intel iPSC/860 and Paragon machines and show that very good performances can be obtained on a ring with asynchronous communications.

  • articleNo Access

    A PIPELINED BROADCAST FOR MULTIDIMENSIONAL MESHES

    We address the problem of performing a pipelined broadcast on a mesh architecture. Meshes require a different approach than other topologies, and their very nature puts a tighter bound on the performance that one can hope to achieve. By using the appropriate techniques, however, one can obtain excellent performance for sufficiently long messages. The resulting algorithm will work on meshes of any dimension with any number of nodes. Our model assumes that the mesh is a torus and/or that it has bidirectional links and uses wormhole routing. Performance data from the Cray T3D are included.

  • articleNo Access

    HIGH-PERFORMANCE MATHEMATICAL FUNCTIONS FOR SINGLE-CORE ARCHITECTURES

    Nowadays high-performance computing (HPC) architectures are designed to resolve assorted sophisticated scientific as well as engineering problems across an ever intensifying number of HPC and professional workloads. Application and computation of key trigonometric functions sine and cosine are in all spheres of our daily life, yet fairly time consuming task in high-performance numerical simulations. In this paper, we have delivered a detailed deliberation of how the micro-architecture of single-core Itanium® and Alpha 21264/21364 processors as well as the manual optimization techniques improve the computing performance of several mathematical functions. On describing the detailed algorithm and its execution pattern on the processor, we have confirmed that the processor micro-architecture side by side manual optimization techniques ameliorate computing performance significantly as compared to not only the standard math library's built-in functions with compiler optimizing options but also Intel® Itanium® library's highly optimized mathematical functions.

  • articleNo Access

    Efficient Audio Filter Using Folded Pipelining Architecture Based on Retiming Using Evolutionary Computation

    It is important in digital signal processing (DSP) architectures to minimize the silicon area of the integrated circuits. This can be achieved by reducing the number of functional units such as adders and multipliers. In literature, folding technique is used to reduce the functional units by executing multiple algorithm operations on a single functional unit. Register minimization techniques are used to reduce the number of registers in a folded architecture. Retiming is a technique that needs to be performed before applying folding. In this paper, retiming is performed based on nature inspired evolutionary computation method. This technique generates the database of solutions from which best solution can be picked for folding further.

    As a part of this work, an efficient folded noise removal audio filter prototype is designed as an application example using evolutionary computation-based retiming and folding with register minimization. Folding technique will however increase the number of registers while multiplexing datapath adder and multiplier elements. Register minimization technique is used after folding to reduce the number of registers. After obtaining retimed, folded filter architecture, low level synthesis is performed which involves mapping of datapath adder and multiplier blocks to actual hardware. Various architectures of adders and multipliers are compared in area-power-performance space and depending on the user defined constraint, folded architecture with specific combination of data path elements is mapped on to hardware. A framework is designed in this paper to automate the entire process which reduces the design cycle time. All the designed filters are targeted for ASIC implementation. The results are compared and are provided as part of simulation results.

  • articleNo Access

    A Fault Tolerant Parallelism Approach for Implementing High-Throughput Pipelined Advanced Encryption Standard

    Advanced Encryption Standard (AES) is the most popular symmetric encryption method, which encrypts streams of data by using symmetric keys. The current preferable AES architectures employ effective methods to achieve two important goals: protection against power analysis attacks and high-throughput. Based on a different architectural point of view, we implement a particular parallel architecture for the latter goal, which is capable of implementing a more efficient pipelining in field-programmable gate array (FPGA). In this regard, all intermediate registers which have a role for unrolling the main loop will be removed. Also, instead of unrolling the main loop of AES algorithm, we implement pipelining structure by replicating nonpipelined AES architectures and using an auto-assigner mechanism for each AES block. By implementing the new pipelined architecture, we achieve two valuable advantages: (a) solving single point of failure problem when one of the replicated parts is faulty and (b) deploying the proposed design as a fault tolerant AES architecture. In addition, we put emphasis on area optimization for all four AES main functions to reduce the overhead associated with AES block replication. The simulation results show that the maximum frequency of our proposed AES architecture is 675.62MHz, and for AES128 the throughput is 86.5Gbps which is 30.9% better than its closest existing competitor.

  • articleNo Access

    Pipelined-Scheduling of Multiple Embedded Applications on a Multi-Processor-SoC

    Due to clock and power constraints, it is hard to extract more power out of single core architectures. Thus, multi-core systems are now the architecture of choice to provide the needed computing power. In embedded system, multi-processor system-on-a-chip (MPSoC) is widely used to provide the needed power to effectively run complex embedded applications. However, to effectively utilize an MPSoC system, tools to generate optimized schedules is highly needed. In this paper, we design an integrated approach to task scheduling and memory partitioning of multiple applications utilizing the MPSoC system simultaneously. This is in contrast to the traditional decoupled approach that looks at task scheduling and memory partitioning as two separate problems. Our framework is also based on pipelined scheduling to increase the throughput of the system. Results on different benchmarks show the effectiveness of our techniques.

  • articleNo Access

    Improved Synthesis of Generalized Parallel Counters on FPGAs Using Only LUTs

    Generalized parallel counters (GPCs) are frequently used to construct high speed compressor trees on field programmable gate arrays (FPGAs). The introduction of fast carry-chain in FPGAs has greatly improved the performance of these elements. Evidently, a large number of GPCs have been proposed in literature that use a combination of look-up tables (LUTs) and carry-chains. In this paper, we take an alternate approach and try to eliminate the carry-chain from the GPC structure. We present a heuristic that aims at synthesizing GPCs on FPGAS using only the general LUT fabric. The resultant GPCs are then easily pipelined by placing registers at the output node of each LUT. We have used our heuristic on various GPCs reported in prior work. Our heuristic successfully eliminates the carry-chain from the GPC structure with an increase in LUT count in some GPCs. Experimentation using Xilinx FPGAs shows that filter systems constructed using our GPCs show an improvement in speed and power performance and a comparable area performance.

  • articleNo Access

    Design and Implementation of Face Detection Architecture for Heterogeneous System-on-Chip

    The seminal work of Viola and Jones for automatic face detection is widely used in many human–computer interaction and computer vision applications. On analyzing the existing face detection architectures, we observed that integral image calculation, feature computation in cascaded classifier, and recursive scanning of image with sliding window at multiple scales are the major reasons which increase the memory and time complexity of the algorithm. Therefore, in this paper, we have proposed a hardware–software co-design of Viola–Jones face detector for System-on-Chip (SoC). In the proposed architecture, integral image computation and cascaded classifier sub-modules are implemented on the hardware — Programmable Logic FPGA (PL-FPGA), while the image scaling and nonmaximum suppression sub-modules are implemented on the software — Processing System ARM (PS-ARM). Concepts of pipelining, folding, and parallel processing are effectively utilized to produce an optimum design architecture. The proposed architecture has been tested on PYNQ-Z1 board. The implementation results in a processing speed of 95 fps with PL and PS clocks of 100MHz and 650MHz, respectively, for an image of QVGA resolution. Results analysis demonstrates that the proposed architecture has minimum resource requirement as compared to state-of-the-art implementations, which facilitates and promotes the usage of resource-constrained low-cost ZYNQ SoC for face detection.

  • articleNo Access

    A Deep Learning Network-on-Chip (NoC)-Based Switch-Router to Enhance Information Security in Resource-Constrained Devices

    In a resource-constrained environment of the 21st century, the use of hardware-based reconfigurable systems such as Field Programmable Gate Array (FPGAs) is considered an effective way to enhance information security. In comparison with traditional custom circuitry that does not give a flexible approach, it is observed that the reconfigurable hardware shows an excellent potential for cyber security by increasing hardware speeds and flexibility. Therefore, in a quest to integrate multi-core systems, the Network-on-Chip (NoC) has become one of the popular widespread techniques to maximize router security. Due to the significant overhead of chip space and the power consumption of the routers, it is substantially more expensive to construct as compared to a bus-based system. The control component (CC) interacts with the networks that inject packets based on router switching and activity. These control components are coupled with each network to produce a system of controlled networks. The system is further linked with CFM or a Centralized Fabric Manager, which serves as the network’s focal point. After that, the CFM runs the algorithm regularly. The analytic parameters comprise flip flop, power, latency, number of lookup tables (LUTs), and throughput. In the proposed method, the number of LUTs is 0.35mm2, the flip flop is 3.5mm2, the power is 3.4μW, the latency is 5941ns, and the planned throughput is 0.56 flits/cycle. Results indicate that the crossbar switch reduces errors and minimizes the delay in the architecture’s outcome level, which further overcomes the descriptions of performance, power throughput, and area delay parameters. The findings of the research can be useful to enhance information security among lightweight devices besides minimizing the chances of network attacks in today’s dynamic and complex cyberspace.

  • articleNo Access

    EFFICIENCY THROUGH REDUCED COMMUNICATION IN MESSAGE PASSING SIMULATION OF NEURAL NETWORKS

    Neural algorithms require massive computation and very high communication bandwidth and are naturally expressed at a level of granularity finer than parallel systems can exploit efficiently. Mapping Neural Networks onto parallel computers has traditionally implied a form of clustering neurons and weights to increase the granularity. SIMD simulations may exceed a million connections per second using thousands of processors, but are often tailored to particular networks and learning algorithms. MIMD simulations required an even larger granularity to run efficiently and often trade flexibility for speed. An alternative technique based on pipelining fewer but larger messages through parallel. “broadcast/accumulate trees” is explored. “Lazy” allocation of messages reduces communication and memory requirements, curbing excess parallelism at run time. The mapping is flexible to changes in network architecture and learning algorithm and is suited for a variety of computer configurations. The method pushes the limits of parallelizing backpropagation and feed-forward type algorithms. Results exceed a million connections per second already on 30 processors and are up to ten times superior to previous results on similar hardware. The implementation techniques can also be applied in conjunction with others, including systolic and VLSI.