Please login to be able to save your searches and receive alerts for new content matching your search criteria.
The Fast Fourier Transform is a mainstay of certain numerical techniques for solving fluid dynamics problems. The Connection Machine CM-2 is the target for an investigation into the design of multidimensional SIMD parallel FFT algorithms for high performance. Critical algorithm design issues are discussed, necessary machine performance measurements are identified and made, and the performance of the developed FFT programs are measured. Our FFT programs are compared to the currently best Cray-2 FFT library program, CFFT2.
Supercomputers are coming into wider use for generating realistic imagery for commercial animation, special effects, and scientific simulation. The Connection Machine requires a more radical rethinking of rendering algorithms than previous supercomputers since it is not intended to function as a scalar processor. A fascinating mix of changes from conventional approaches is emerging. Some procedures can run virtually unchanged while others must be turned completely inside out. We have confidence in the viability of the Connection Machine as an architecture for high-end computer graphics. For complex scenes resulting in at least tens of thousands of polygons per frame, most steps of the rendering pipeline can make effective use of the massive number of processors available.
Early approaches to massively parallel graphics systems have focused on processor per pixel organizations. We show that a dynamic mix of organizations, including processor per pixel, processor per vertex, and processor per polygon are necessary. Additionally, we note that an apparent consequence of the style of algorithm enforced by the Connection Machine is an enormously increased appetite for memory. We explore standard algorithms for image generation and note the differences that arise in an implementation for the Connetion Machine. We conclude by attempting a comparison of the viability of alternative computing environments for our application.
We develop an analytical model of grid communication on the connection machine for arbitrary grid geometries over exact power-of-two distances. There is excellent agreement between the model and the measured performance of the CM2. We propose some conceptually simple enhancements to the virtual processor mechanism and to the grid communication microcode which could lead to substantial performance improvements for several important numerical algorithms.
This paper presents an outline of the Japanese national research project "High speed computing system for scientific and technological uses", not only in terms of its technological aspects but also its social significance and role.
Two new parallel integer sorting algorithms, queue-sort and barrel-sort, are presented and analyzed in detail. These algorithms do not have optimal parallel complexity, yet they show very good performance in practice. Queue-sort is designed for fine-scale parallel architectures which allow the queueing of multiple messages to the same destination. Barrel-sort is designed for medium-scale parallel architectures with a high message passing overhead. The performance results from the implementation of queue-sort on a Connection Machine CM-2 and barrel-sort on a 128 processor iPSC/860 are given. The two implementations are found to be comparable in performance but not as good as a fully vectorized bucket sort on the Cray YMP.
The CM-2 is an example of a connection machine. The strengths and problems of this implementation are considered. Important issues in the architecture and programming environment of connection machines in general are discussed. These are contrasted with the same issues in MIMD multiprocessors and multicomputers.
Communication penalty for parallel computation is related to message startup time and speed of data transmission between the host and processing elements (PEs). We propose two algorithms in this paper to show that the first factor can be alleviated by reducing the number of messages and the second by making the host-PE communication concurrent with computation on the PE array.
The algorithms perform 2n consecutive sums of 2n numbers each on a hypercube of degree n. The first algorithm leaves one sum on each processor. It takes n steps to complete the sums and reduces the number of messages generated by a PE from 2n to n. The second algorithm sends all the sums back to the host as the sums are generated one by one. It takes 2n+n−1 steps to complete the sums in a pipeline so that one sum is completed every step after the initial (n−1) steps.
We apply our second algorithm to the front-to-back composition for ray casting. For large number of rays, the efficiency and speedup of our algorithm are close to theoretically optimal values.
This paper describes the architecture of the General Purpose with Floating Point support (GPFP) processing element, which uses the expansion of circuitry from VLSI advances to provide on-chip memory and cost-effective extra functionality. A major goal was to accelerate floating point arithmetic. This was combined with architectural aims of cost-effectiveness, achieving the floating-point capability from general-purpose units, and retaining the 1-bit manipulations available in the earlier generation.
With a 50 MHz clock each PE is capable of 2.5 MegaFlops. Normalized to the same clock rate, the GPFP PE exceeds first generation PEs by far, namely the DAP by a factor of 50 and the MPP by a factor of 20, and also outperforms the recent MasPar design by a factor of four. A 32×32 GPFP array is capable of up to 2.5 GigaFlops and 6500 MIPS, on 32-bit additions. These speedups are obtained by architectural features rather than increased width of data-handling and are combined with parsimonious use of circuitry compatible with massively parallel fabrication.
The GPFP also incorporates Reconfigurable Local Control (RLC), a technique that combines a considerable degree of local autonomy within PEs and microcode flexibility, giving the machine improved general-purpose programmability in addition to floating-point numerical performance.
We present a vector-friendly blocked computing strategy for the lattice Boltzmann method (LBM). This strategy, along with a recently developed data structure, Structure of Arrays of Structures (SoAoS), is implemented for multi-relaxation type lattice Boltzmann (LB). The proposed methodology enables optimal memory bandwidth utilization in the advection step and high compute efficiency in the collision step of LB implementation. In a dense computing environment, current performance optimization framework for LBM is able to achieve high single-core efficiency.
Much work has been done to optimize wavelet transforms for SIMD extensions of modern CPUs. However, these approaches are mostly restricted to the vertical part of 2-D transforms with line-wise organized memory layouts because this leads to a rather straight forward SIMD-implementation. This work shows for an example of a common wavelet filter new approaches to use SIMD operations on 1-D transforms that are able to produce reasonable speedups. As a result, the performance of algorithms that use wavelet transforms, such as JPEG2000, can be increased significantly. Various variants of parallelization are presented and compared. Their advantages and disadvantages for general filters are discussed.
The latency of broadcast/reduction operations has a significant impact on the performance of SIMD processors. This is especially true for associative programs, which make extensive use of global search operations. Previously, we developed a prototype associative SIMD processor that uses hardware multithreading to overcome the broadcast/reduction latency. In this paper we show, through simulations of the processor running an associative program, that hardware multithreading is able to improve performance by increasing system utilization, even for processors with hundreds or thousands of processing elements. However, the choice of thread scheduling policy used by the hardware is critical in determining the actual utilization achieved. We consider three thread scheduling policies and show that a thread scheduler that avoids issuing threads that will stall due to pipeline dependencies or thread synchronization operations is able to maintain system utilization independent of the number of threads.
The polyhedral model for loop parallelization has proved to be an effective tool for advanced optimization and automatic parallelization of programs in higher-level languages. Yet, to integrate such optimizations seamlessly into production compilers, they must be performed on the compiler's internal, low-level, intermediate representation (IR). With Polly, we present an infrastructure for polyhedral optimizations on such an IR. We describe the detection of program parts amenable to a polyhedral optimization (so-called static control parts), their translation to a Z-polyhedral representation, optimizations on this representation and the generation of optimized IR code. Furthermore, we define an interface for connecting external optimizers and present a novel way of using the parallelism they introduce to generate SIMD and OpenMP code. To evaluate Polly, we compile the PolyBench 2.0 benchmarks fully automatically with PLuTo as external optimizer and parallelizer. We can report on significant speedups.
This paper deals with the problem of finding the K smallest elements out of a totally ordered but non-sorted set of N elements. This problem, called K-selection, arises often in statistics, image processing and distributed computing. We propose two algorithms to solve this problem in hypercubes. Our first algorithm is asymptotically optimal when K = O((log N)β), for any constant β. The second enlarges the range of optimality to K = N∊, ∊ < 1, using a recursive strategy. These are major improvements on previous results.
This paper describes the implementation of a parallel image processing algorithm, the aim of which is to give good contrast enhancement in real time, especially on the boundaries of an object of interest defined by a grey homogeneity (for example, an object of medical interest having a functional or morphologic homogeneity, like a bone or tumor). The implementation of a neural network algorithm which does this contrast enhancement has been done on a SIMD massively parallel machine (a MasPar of 8192 processors) and the communication between its processors has been optimized.
Strategies for computing the continuous wavelet transform on massively parallel SIMD arrays are introduced and discussed. The different approaches are theoretically assessed and the results of implementations on a MasPar MP-2 are compared.
An approach for designing a hybrid parallel system that can perform different levels of parallelism adaptively is presented. An adaptive parallel computer vision system (APVIS) is proposed to attain this goal. The APVIS is constructed by integrating two different types of parallel architectures, i.e. a multiprocessor based system (MBS) and a memory based processor array (MPA), tightly into a single machine. One important feature in the APVIS is that the programming interface to execute data parallel code onto the MPA is the same as the usual subroutine calling mechanism. Thus the existence of the MPA is transparent to the programmers. This research is to design an underlying base architecture that can be optimally executed for a broad range of vision tasks. A performance model is provided to show the effectiveness of the APVIS. It turns out that the proposed APVIS can provide significant performance improvement and cost effectiveness for highly parallel applications having a mixed set of parallelisms. Also an example application composed of a series of vision algorithms, from low-level and medium-level processing steps, is mapped onto the MPA. Consequently, the APVIS with a few or tens of MPA modules can perform the chosen example application in real time when multiple images are incoming successively with a few seconds inter-arrival time.
Molecular dynamics (MD) is widely used in computational biology for studying binding mechanisms of molecules, molecular transport, conformational transitions, protein folding, etc. The method is computationally expensive; thus, the demand for the development of novel, much more efficient algorithms is still high. Therefore, the new algorithm designed in 2007 and called interaction sorting (IS) clearly attracted interest, as it outperformed the most efficient MD algorithms. In this work, a new IS modification is proposed which allows the algorithm to utilize SIMD processor instructions. This paper shows that the improvement provides an additional gain in performance, 9% to 45% in comparison to the original IS method.