Loading [MathJax]/jax/output/CommonHTML/jax.js
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at customercare@wspc.com for any enquiries.

SEARCH GUIDE  Download Search Tip PDF File

  • articleNo Access

    DATA MOVEMENT TECHNIQUES ON RECONFIGURABLE MESHES, WITH APPLICATIONS

    Data movement operations are central to many efficient algorithms for parallel machines constructed as interconnection networks of processors. The purpose of this work is to present a number of basic data movement techniques for reconfigurable meshes. These include computing the prefix maxima of a sequence of real numbers and computing the prefix sums of a binary sequence. The data movement techniques presented in this paper have far-reaching applications. Among them, we present a new sorting algorithm and an efficient algorithm to find all maximal elements of a planar set of points.

  • articleNo Access

    AN IMPROVED PARALLEL PREFIX ALGORITHM ON OTIS-MESH

    Wang and Sahni [4] reported two parallel algorithms for N-point prefix computation on an N-processor OTIS-Mesh optoelectronic computer. The overall time complexity for both SIMS and MIMD models of their first algorithm was shown to be (8 N1/4 - 1) electronic moves and 2 OTIS moves. This was further reduced to (7 N1/4 - 1) electronic moves and 2 OTIS moves in their second algorithm. We present here an improved parallel algorithm for N-point prefix computation on an N-processor OTIS-Mesh, which needs (5.5 N1/4 + 3) electronic moves and 2 OTIS moves. Our algorithm is based on the general theme of parallel prefix algorithm proposed in [4] but following the data distribution and local prefix computation similar to that of [1].

  • articleNo Access

    OPTIMIZATION OF LINKED LIST PREFIX COMPUTATIONS ON MULTITHREADED GPUS USING CUDA

    We present a number of optimization techniques to compute prefix sums on linked lists and implement them on the multithreaded GPUs Tesla C1060, Tesla C2050, and GTX480 using CUDA. Prefix computations on linked structures involve in general highly irregular fine grain memory accesses that are typical of many computations on linked lists, trees, and graphs. While the current generation of GPUs provides substantial computational power and extremely high bandwidth memory accesses, they may appear at first to be primarily geared toward streamed, highly data parallel computations. In this paper, we introduce an optimized multithreaded GPU algorithm for prefix computations through a randomization process that reduces the problem to a large number of fine-grain computations. We map these fine-grain computations onto multithreaded GPUs in such a way that the processing cost per element is shown to be close to the best possible. Our experimental results show scalability for list sizes ranging from 1M nodes to 256M nodes, and significantly improve on the recently published parallel implementations of list ranking, including implementations on the Cell Processor, the MTA-8, and the NVIDIA GT200 and Fermi series. They also compare favorably to the performance of the best known CUDA algorithm for the scan operation on the Tesla C1060 and GTX480.

  • articleNo Access

    Fast Parallel Algorithm for Prefix Computation in Multi-Mesh Architecture

    A parallel algorithm for prefix computation on N data elements mapped on a Multi Mesh (MM) network of N=n4 processing elements is presented here. The time required by the proposed algorithm is significantly less than that by any of the existing algorithms for prefix computation on mesh-like architectures due to the specific interconnection pattern used in the MM network. The proposed technique requires O(N1/4) time for data communication and O(logN1/4) time for computation, when mapped on a MM network constituted by N1/2 meshes, each of size N1/4×N1/4. The data communication time in the proposed algorithm is less than the prefix sum algorithm proposed in extended Multi Mesh. To be precise, instead of (13N1/45)τ communication time the proposed algorithm requires a data communication time of 7.5N1/4τ only. Moreover, the proposed parallel algorithm does not need any extra inter block links as used in the extended Multi Mesh.