Advanced Search

Narrow Results

Results: 1 - 17of17

Follow results:

refine search

Filters

per page:

Sort: Relevance

Context for search term 1Search term 1*

All Dates

LastSelect static range

Custom Range

Select starting monthSelect starting year

Select ending monthSelect ending year

Advanced

Search name	Searched On	Run search
Keyword: SIMD (17)	29 Mar 2025	Run
Keyword: Kink (12)	29 Mar 2025	Run
Keyword: Fold (4)	29 Mar 2025	Run
Keyword: Clay (17)	29 Mar 2025	Run
Keyword: PWBA (1)	29 Mar 2025	Run

articleNo Access
FAST FOURIER TRANSFORM ALGORITHM DESIGN AND TRADEOFFS ON THE CM-2
- RAY A. KAMIN, III and
- GEORGE B. ADAMS, III
International Journal of High Speed Computing01 Jun 1989
Preview Abstract
The Fast Fourier Transform is a mainstay of certain numerical techniques for solving fluid dynamics problems. The Connection Machine CM-2 is the target for an investigation into the design of multidimensional SIMD parallel FFT algorithms for high performance. Critical algorithm design issues are discussed, necessary machine performance measurements are identified and made, and the performance of the developed FFT programs are measured. Our FFT programs are compared to the currently best Cray-2 FFT library program, CFFT2.
articleNo Access
3D IMAGE SYNTHESIS ON THE CONNECTION MACHINE
- FRANKLIN C. CROW,
- GARY DEMOS,
- JIM HARDY,
- JOHN MCLAUGHLIN, and
- KARL SIMS
International Journal of High Speed Computing01 Jun 1989
Preview Abstract
Supercomputers are coming into wider use for generating realistic imagery for commercial animation, special effects, and scientific simulation. The Connection Machine requires a more radical rethinking of rendering algorithms than previous supercomputers since it is not intended to function as a scalar processor. A fascinating mix of changes from conventional approaches is emerging. Some procedures can run virtually unchanged while others must be turned completely inside out. We have confidence in the viability of the Connection Machine as an architecture for high-end computer graphics. For complex scenes resulting in at least tens of thousands of polygons per frame, most steps of the rendering pipeline can make effective use of the massive number of processors available.
Early approaches to massively parallel graphics systems have focused on processor per pixel organizations. We show that a dynamic mix of organizations, including processor per pixel, processor per vertex, and processor per polygon are necessary. Additionally, we note that an apparent consequence of the style of algorithm enforced by the Connection Machine is an enormously increased appetite for memory. We explore standard algorithms for image generation and note the differences that arise in an implementation for the Connetion Machine. We conclude by attempting a comparison of the viability of alternative computing environments for our application.
articleNo Access
GRID COMMUNICATION ON THE CONNECTION MACHINE: ANALYSIS, PERFORMANCE, AND IMPROVEMENTS
- CREON LEVIT
International Journal of High Speed Computing01 Jun 1989
Preview Abstract
We develop an analytical model of grid communication on the connection machine for arbitrary grid geometries over exact power-of-two distances. There is excellent agreement between the model and the measured performance of the CM2. We propose some conceptually simple enhancements to the virtual processor mechanism and to the grid communication microcode which could lead to substantial performance improvements for several important numerical algorithms.
articleNo Access
RESEARCH RESULTS OF THE PROJECT "HIGH SPEED COMPUTING SYSTEM FOR SCIENTIFIC AND TECHNOLOGICAL USES"
- KOICHIRO TAMURA
International Journal of High Speed Computing01 Sep 1991
Preview Abstract
This paper presents an outline of the Japanese national research project "High speed computing system for scientific and technological uses", not only in terms of its technological aspects but also its social significance and role.
articleNo Access
PARALLEL INTEGER SORTING WITH MEDIUM AND FINE-SCALE PARALLELISM
- LEONARDO DAGUM
International Journal of High Speed Computing01 Dec 1993
Preview Abstract
Two new parallel integer sorting algorithms, queue-sort and barrel-sort, are presented and analyzed in detail. These algorithms do not have optimal parallel complexity, yet they show very good performance in practice. Queue-sort is designed for fine-scale parallel architectures which allow the queueing of multiple messages to the same destination. Barrel-sort is designed for medium-scale parallel architectures with a high message passing overhead. The performance results from the implementation of queue-sort on a Connection Machine CM-2 and barrel-sort on a 128 processor iPSC/860 are given. The two implementations are found to be comparable in performance but not as good as a fully vectorized bucket sort on the Cray YMP.
articleNo Access
AN ASSESSMENT OF THE CONNECTION MACHINE
- ROBERT SCHREIBER
International Journal of High Speed Computing01 Dec 1993
Preview Abstract
The CM-2 is an example of a connection machine. The strengths and problems of this implementation are considered. Important issues in the architecture and programming environment of connection machines in general are discussed. These are contrasted with the same issues in MIMD multiprocessors and multicomputers.
articleNo Access
PARALLEL AND PIPELINED PARALLEL CONSECUTIVE SUMS ON A HYPERCUBE WITH APPLICATION TO RAY CASTING
- JIANJIAN SONG and
- RENBEN SHU
International Journal of High Speed Computing01 Mar 1995
Preview Abstract
Communication penalty for parallel computation is related to message startup time and speed of data transmission between the host and processing elements (PEs). We propose two algorithms in this paper to show that the first factor can be alleviated by reducing the number of messages and the second by making the host-PE communication concurrent with computation on the PE array.
The algorithms perform 2ⁿ consecutive sums of 2ⁿ numbers each on a hypercube of degree n. The first algorithm leaves one sum on each processor. It takes n steps to complete the sums and reduces the number of messages generated by a PE from 2ⁿ to n. The second algorithm sends all the sums back to the host as the sums are generated one by one. It takes 2ⁿ+n−1 steps to complete the sums in a pipeline so that one sum is completed every step after the initial (n−1) steps.
We apply our second algorithm to the front-to-back composition for ray casting. For large number of rays, the efficiency and speedup of our algorithm are close to theoretically optimal values.
articleNo Access
DESIGN OF A PROCESSOR ELEMENT FOR A HIGH PERFORMANCE MASSIVELY PARALLEL SIMD SYSTEM
- DON BEAL and
- COSTAS LAMBRINOUDAKIS
International Journal of High Speed Computing01 Sep 1995
Preview Abstract
This paper describes the architecture of the General Purpose with Floating Point support (GPFP) processing element, which uses the expansion of circuitry from VLSI advances to provide on-chip memory and cost-effective extra functionality. A major goal was to accelerate floating point arithmetic. This was combined with architectural aims of cost-effectiveness, achieving the floating-point capability from general-purpose units, and retaining the 1-bit manipulations available in the earlier generation.
With a 50 MHz clock each PE is capable of 2.5 MegaFlops. Normalized to the same clock rate, the GPFP PE exceeds first generation PEs by far, namely the DAP by a factor of 50 and the MPP by a factor of 20, and also outperforms the recent MasPar design by a factor of four. A 32×32 GPFP array is capable of up to 2.5 GigaFlops and 6500 MIPS, on 32-bit additions. These speedups are obtained by architectural features rather than increased width of data-handling and are combined with parsimonious use of circuitry compatible with massively parallel fabrication.
The GPFP also incorporates Reconfigurable Local Control (RLC), a technique that combines a considerable degree of local autonomy within PEs and microcode flexibility, giving the machine improved general-purpose programmability in addition to floating-point numerical performance.
articleNo Access
ON VECTORIZATION FOR LATTICE BASED SIMULATIONS
International Journal of Modern Physics C13 Nov 2013
Preview Abstract
We present a vector-friendly blocked computing strategy for the lattice Boltzmann method (LBM). This strategy, along with a recently developed data structure, Structure of Arrays of Structures (SoAoS), is implemented for multi-relaxation type lattice Boltzmann (LB). The proposed methodology enables optimal memory bandwidth utilization in the advection step and high compute efficiency in the collision step of LB implementation. In a dense computing environment, current performance optimization framework for LBM is able to achieve high single-core efficiency.
articleNo Access
PARALLELIZATION OF WAVELET FILTERS USING SIMD EXTENSIONS
- RADE KUTIL and
- PETER EDER
Parallel Processing Letters01 Sep 2006
Preview Abstract
Much work has been done to optimize wavelet transforms for SIMD extensions of modern CPUs. However, these approaches are mostly restricted to the vertical part of 2-D transforms with line-wise organized memory layouts because this leads to a rather straight forward SIMD-implementation. This work shows for an example of a common wavelet filter new approaches to use SIMD operations on 1-D transforms that are able to produce reasonable speedups. As a result, the performance of algorithms that use wavelet transforms, such as JPEG2000, can be increased significantly. Various variants of parallelization are presented and compared. Their advantages and disadvantages for general filters are discussed.
articleNo Access
USING HARDWARE MULTITHREADING TO OVERCOME BROADCAST/REDUCTION LATENCY IN AN ASSOCIATIVE SIMD PROCESSOR
- KEVIN SCHAFFER and
- ROBERT A. WALKER
Parallel Processing Letters01 Dec 2008
Preview Abstract
The latency of broadcast/reduction operations has a significant impact on the performance of SIMD processors. This is especially true for associative programs, which make extensive use of global search operations. Previously, we developed a prototype associative SIMD processor that uses hardware multithreading to overcome the broadcast/reduction latency. In this paper we show, through simulations of the processor running an associative program, that hardware multithreading is able to improve performance by increasing system utilization, even for processors with hundreds or thousands of processing elements. However, the choice of thread scheduling policy used by the hardware is critical in determining the actual utilization achieved. We consider three thread scheduling policies and show that a thread scheduler that avoids issuing threads that will stall due to pipeline dependencies or thread synchronization operations is able to maintain system utilization independent of the number of threads.
articleNo Access
POLLY — PERFORMING POLYHEDRAL OPTIMIZATIONS ON A LOW-LEVEL INTERMEDIATE REPRESENTATION
Parallel Processing Letters01 Dec 2012
Preview Abstract
The polyhedral model for loop parallelization has proved to be an effective tool for advanced optimization and automatic parallelization of programs in higher-level languages. Yet, to integrate such optimizations seamlessly into production compilers, they must be performed on the compiler's internal, low-level, intermediate representation (IR). With Polly, we present an infrastructure for polyhedral optimizations on such an IR. We describe the detection of program parts amenable to a polyhedral optimization (so-called static control parts), their translation to a Z-polyhedral representation, optimizations on this representation and the generation of optimized IR code. Furthermore, we define an interface for connecting external optimizers and present a novel way of using the parallelism they introduce to generate SIMD and OpenMP code. To evaluate Polly, we compile the PolyBench 2.0 benchmarks fully automatically with PLuTo as external optimizer and parallelizer. We can report on significant speedups.
articleNo Access
EFFICIENT K-SELECTION IN HYPERCUBE MULTIPROCESSORS
- PASCAL BERTHOMÉ
Parallel Processing Letters01 Sep 1992
Preview Abstract
This paper deals with the problem of finding the K smallest elements out of a totally ordered but non-sorted set of N elements. This problem, called K-selection, arises often in statistics, image processing and distributed computing. We propose two algorithms to solve this problem in hypercubes. Our first algorithm is asymptotically optimal when K = O((log N)^β), for any constant β. The second enlarges the range of optimality to K = N^∊, ∊ < 1, using a recursive strategy. These are major improvements on previous results.
articleNo Access
Parallel Image Processing Using Neural Networks: Applications in Contrast Enhancement of Medical Images
- J. Mattes,
- D. Trystram, and
- J. Demongeot
Parallel Processing Letters01 Mar 1998
Preview Abstract
This paper describes the implementation of a parallel image processing algorithm, the aim of which is to give good contrast enhancement in real time, especially on the boundaries of an object of interest defined by a grey homogeneity (for example, an object of medical interest having a functional or morphologic homogeneity, like a bone or tumor). The implementation of a neural network algorithm which does this contrast enhancement has been done on a SIMD massively parallel machine (a MasPar of 8192 processors) and the communication between its processors has been optimized.
articleNo Access
COMPUTATION OF THE CONTINUOUS WAVELET TRANSFORM ON MASSIVELY PARALLEL SIMD ARRAYS
- MANFRED FEIL,
- ANDREAS UHL, and
- MARIAN VAJTERŠIC
Parallel Processing Letters01 Dec 1999
Preview Abstract
Strategies for computing the continuous wavelet transform on massively parallel SIMD arrays are introduced and discussed. The different approaches are theoretically assessed and the results of implementations on a MasPar MP-2 are compared.
articleNo Access
An Adaptive Parallel Computer Vision System
- Jeong-Min Kim,
- Youngsik Kim,
- Shin-Dug Kim,
- Tack-Don Han, and
- Sung-Bong Yang
International Journal of Pattern Recognition and Artificial Intelligence01 May 1998
Preview Abstract
An approach for designing a hybrid parallel system that can perform different levels of parallelism adaptively is presented. An adaptive parallel computer vision system (APVIS) is proposed to attain this goal. The APVIS is constructed by integrating two different types of parallel architectures, i.e. a multiprocessor based system (MBS) and a memory based processor array (MPA), tightly into a single machine. One important feature in the APVIS is that the programming interface to execute data parallel code onto the MPA is the same as the usual subroutine calling mechanism. Thus the existence of the MPA is transparent to the programmers. This research is to design an underlying base architecture that can be optimally executed for a broad range of vision tasks. A performance model is provided to show the effectiveness of the APVIS. It turns out that the proposed APVIS can provide significant performance improvement and cost effectiveness for highly parallel applications having a mixed set of parallelisms. Also an example application composed of a series of vision algorithms, from low-level and medium-level processing steps, is mapped onto the MPA. Consequently, the APVIS with a few or tens of MPA modules can perform the chosen example application in real time when multiple images are incoming successively with a few seconds inter-arrival time.
articleNo Access
Interaction sorting method for molecular dynamics on multi-core SIMD CPU architecture
- Sergey Matvienko,
- Nikolay Alemasov, and
- Eduard Fomin
Journal of Bioinformatics and Computational Biology01 Feb 2015
Preview Abstract
Molecular dynamics (MD) is widely used in computational biology for studying binding mechanisms of molecules, molecular transport, conformational transitions, protein folding, etc. The method is computationally expensive; thus, the demand for the development of novel, much more efficient algorithms is still high. Therefore, the new algorithm designed in 2007 and called interaction sorting (IS) clearly attracted interest, as it outperformed the most efficient MD algorithms. In this work, a new IS modification is proposed which allows the algorithm to utilize SIMD processor instructions. This paper shows that the improvement provides an additional gain in performance, 9% to 45% in comparison to the original IS method.

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Narrow Results

Filters Applied

Publication Type

Article Type

Publication Date

Author

Publication/Book Series

Subjects

Access

FAST FOURIER TRANSFORM ALGORITHM DESIGN AND TRADEOFFS ON THE CM-2

3D IMAGE SYNTHESIS ON THE CONNECTION MACHINE

GRID COMMUNICATION ON THE CONNECTION MACHINE: ANALYSIS, PERFORMANCE, AND IMPROVEMENTS

RESEARCH RESULTS OF THE PROJECT "HIGH SPEED COMPUTING SYSTEM FOR SCIENTIFIC AND TECHNOLOGICAL USES"

PARALLEL INTEGER SORTING WITH MEDIUM AND FINE-SCALE PARALLELISM

AN ASSESSMENT OF THE CONNECTION MACHINE

PARALLEL AND PIPELINED PARALLEL CONSECUTIVE SUMS ON A HYPERCUBE WITH APPLICATION TO RAY CASTING

DESIGN OF A PROCESSOR ELEMENT FOR A HIGH PERFORMANCE MASSIVELY PARALLEL SIMD SYSTEM

ON VECTORIZATION FOR LATTICE BASED SIMULATIONS

PARALLELIZATION OF WAVELET FILTERS USING SIMD EXTENSIONS

USING HARDWARE MULTITHREADING TO OVERCOME BROADCAST/REDUCTION LATENCY IN AN ASSOCIATIVE SIMD PROCESSOR

POLLY — PERFORMING POLYHEDRAL OPTIMIZATIONS ON A LOW-LEVEL INTERMEDIATE REPRESENTATION

EFFICIENT K-SELECTION IN HYPERCUBE MULTIPROCESSORS

Parallel Image Processing Using Neural Networks: Applications in Contrast Enhancement of Medical Images

COMPUTATION OF THE CONTINUOUS WAVELET TRANSFORM ON MASSIVELY PARALLEL SIMD ARRAYS

An Adaptive Parallel Computer Vision System

Interaction sorting method for molecular dynamics on multi-core SIMD CPU architecture