Please login to be able to save your searches and receive alerts for new content matching your search criteria.
Data movement operations are central to many efficient algorithms for parallel machines constructed as interconnection networks of processors. The purpose of this work is to present a number of basic data movement techniques for reconfigurable meshes. These include computing the prefix maxima of a sequence of real numbers and computing the prefix sums of a binary sequence. The data movement techniques presented in this paper have far-reaching applications. Among them, we present a new sorting algorithm and an efficient algorithm to find all maximal elements of a planar set of points.
Wang and Sahni [4] reported two parallel algorithms for N-point prefix computation on an N-processor OTIS-Mesh optoelectronic computer. The overall time complexity for both SIMS and MIMD models of their first algorithm was shown to be (8 N1/4 - 1) electronic moves and 2 OTIS moves. This was further reduced to (7 N1/4 - 1) electronic moves and 2 OTIS moves in their second algorithm. We present here an improved parallel algorithm for N-point prefix computation on an N-processor OTIS-Mesh, which needs (5.5 N1/4 + 3) electronic moves and 2 OTIS moves. Our algorithm is based on the general theme of parallel prefix algorithm proposed in [4] but following the data distribution and local prefix computation similar to that of [1].
We present a number of optimization techniques to compute prefix sums on linked lists and implement them on the multithreaded GPUs Tesla C1060, Tesla C2050, and GTX480 using CUDA. Prefix computations on linked structures involve in general highly irregular fine grain memory accesses that are typical of many computations on linked lists, trees, and graphs. While the current generation of GPUs provides substantial computational power and extremely high bandwidth memory accesses, they may appear at first to be primarily geared toward streamed, highly data parallel computations. In this paper, we introduce an optimized multithreaded GPU algorithm for prefix computations through a randomization process that reduces the problem to a large number of fine-grain computations. We map these fine-grain computations onto multithreaded GPUs in such a way that the processing cost per element is shown to be close to the best possible. Our experimental results show scalability for list sizes ranging from 1M nodes to 256M nodes, and significantly improve on the recently published parallel implementations of list ranking, including implementations on the Cell Processor, the MTA-8, and the NVIDIA GT200 and Fermi series. They also compare favorably to the performance of the best known CUDA algorithm for the scan operation on the Tesla C1060 and GTX480.
A parallel algorithm for prefix computation on N data elements mapped on a Multi Mesh (MM) network of N=n4 processing elements is presented here. The time required by the proposed algorithm is significantly less than that by any of the existing algorithms for prefix computation on mesh-like architectures due to the specific interconnection pattern used in the MM network. The proposed technique requires O(N1/4) time for data communication and O(logN1/4) time for computation, when mapped on a MM network constituted by N1/2 meshes, each of size N1/4×N1/4. The data communication time in the proposed algorithm is less than the prefix sum algorithm proposed in extended Multi Mesh. To be precise, instead of (13N1/4−5)τ communication time the proposed algorithm requires a data communication time of 7.5N1/4τ only. Moreover, the proposed parallel algorithm does not need any extra inter block links as used in the extended Multi Mesh.