![]() |
The near future will see the increased use of parallel computing technologies at all levels of mainstream computing. Computer hardware increasingly employs parallel techniques to improve computing power for the solution of large scale and computer intensive applications. Cluster and grid technologies make possible high speed computing facilities at vastly reduced costs.
These developments can be expected to result in the extended use of all types of parallel computers in virtually all areas of human endeavour. Computer intensive problems in emerging areas such as financial modelling, data mining and multimedia systems, in addition to traditional application areas of parallel computing such as scientific computing and simulation, will lead to further progress. Parallel computing as a field of scientific research and development has already become one of the fundamental computing technologies.
This book gives an overview of new developments in parallel computing at the start of the 21st century, as well as a perspective on future developments.
https://doi.org/10.1142/9781860949630_fmatter
COMMITTEES.
PREFACE.
CONTENTS.
https://doi.org/10.1142/9781860949630_0001
This paper discusses middleware under development which couples cluster system information with the specifics of a user problem to launch cluster based applications on the best set of resources available. The user is responsible for stating his numerical problem and is assumed to be working in a serial environment. He calls the middleware to execute his application. The middleware assesses the possibility of solving the problem faster on some subset of available resources based on information describing the state of the system. If so, the user’s data is redistributed over the subset of processors, the problem is executed in parallel, and the solution is returned to the user. If it is no faster, or slower, the user’s problem is solved with the best appropriate serial routine. The feasibility of this approach is empirically investigated on a typical target system and results reported which validate the method.
https://doi.org/10.1142/9781860949630_0002
No abstract received.
https://doi.org/10.1142/9781860949630_0003
No abstract received.
https://doi.org/10.1142/9781860949630_0004
The efficient execution of scientific simulations on HPC systems requires a partitioning of the underlying mesh among the processors such that the load is balanced and the inter-processor communication is minimized. Graph partitioning algorithms have been applied with much success for this purpose. However, the parallelization of multi-phase and multi-physics computations poses new challenges that require fundamental advances in graph partitioning technology. In addition, most existing graph partitioning algorithms are not suited for the newer heterogeneous high-performance computing platforms. This talk will describe research efforts in our group that are focused on developing novel multi-constraint and multi-objective graph partitioning algorithms that can support the advancing state-of-the-art in numerical simulation technologies. In addition, we will present our preliminary work on new partitioning algorithms that are well suited for heterogeneous architectures.
https://doi.org/10.1142/9781860949630_0005
No abstract received.
https://doi.org/10.1142/9781860949630_0006
The determination of physical properties of flavor singlet objects like the η′ meson by computer simulation requires the computation of functionals of the inverse fermionic matrix M−1. So far, only stochastic methods could cope with the enormous size of M. In this paper, we introduce an alternative approach which is based on the computation of a subset of low-lying eigenmodes of the fermionic matrix. The high quality of this ‘truncated eigenmode approximation’ (TEA) is demonstrated by comparison with the pion correlator, a flavor octet quantity, which is readily computable through a linear system of equations. We show that TEA can suc- cessfully approximate the flavor singlet η′ correlator. We find that the systematic error of the method is tolerable. As the determination of the chosen subset of 300 eigenmodes requires about 3.5 T flops-hours CPU-time per canonical ensemble and at least 15 GBytes of memory, the power of high-end supercomputers like the CRAY T3E is indispensable.
https://doi.org/10.1142/9781860949630_0007
This paper deals with a parallel approach to the verification of consistency aspects of an industrial product configuration data base. The data base we analyze is used by DaimlerChrysler to check the orders for cars and commercial vehicles of their Mercedes lines. By formalizing the ordering process and employing techniques from symbolic computation we could establish a set of tools that allow the automatic execution of huge series of consistency checks, thereby ultimately enhancing the quality of the product data. However, occasional occurrences of computation intensive checks are a limiting factor for the usability of the tools. Therefore, a prototypical parallel re-implementation using our Distributed Object-Oriented Threads System (DOTS) was carried out. Performance measurements on a heterogeneous cluster of shared-memory multiprocessor Unix workstations and standard Windows PCs revealed considerable speed-ups and substantially reduced the average waiting time for individual checks. We thus arrive at a noticeable improvement in usability of the consistency checking tools.
https://doi.org/10.1142/9781860949630_0008
This paper describes the effort taken to introduce a parallel multiblock structure into the URANUS simulation code for calculating reentry flows on structured c-meshes. The new data structure, the handling of block sides at physical and inner boundaries and the load balancing approach are presented. Target is the efficient calculation of reentry flows using multiblock meshes on current and future supercomputer platforms.
https://doi.org/10.1142/9781860949630_0009
We describe a software environment for the coupling of different cellular automata. The system can couple CA for parallelization, can couple CA with different lattice sizes, different spatial or temporal resolutions, and even different state sets. It can also couple CA to other (possibly non-CA) simulations. The complete system, which is integrated with the CA simulation system JCASim, is written in Java and and uses remote method invocation as the communication method. The parallel efficiency is found to be satisfactory for large enough simulations.
https://doi.org/10.1142/9781860949630_0010
The rigorous characterization of the behaviour of a radiobase antenna for wireless communication systems is a hot topic both for antenna or communication system design and for radioprotection-hazard reasons.
Such a characterization deserves a numerical solution, and the use of a Finite-Difference Time-Domain (FD-TD) approach is one of the most attractive candidates. It has strong memory and CPU-time requirements, and parallel computing is the suitable way to tackle this problem. In this work we discuss a parallel implementation of the FD-TD code, present the findings achieved on the APE/Quadrics SIMD massively parallel systems, and discuss results related to the human exposure to the near-field of real radiobase antennas.
Results clearly demonstrate that massively parallel processing is a viable approach to solve electromagnetic problems, allowing the simulation of radiating devices which could not be modeled through conventional computing systems: a detailed numerical human phantom is used to solve the human-antenna interaction problem, and the absorbed fields inside it are estimated. The achieved solution accuracy paves the way to a rigorous evaluation of the radiofrequency safety standards in a relevant class of occupational cases.
https://doi.org/10.1142/9781860949630_0011
The use of parallel processing to speed-up Geographic Information System (GIS) algorithms and applications has been widely considered and documented. In this paper we present our experience and early results about the use of parallel computing to speed-up tranquillity mapping, a methodology which is finalised to provide support to the landscape assessment process, and which is of interest for public administrations. The parallelisation of tranquillity mapping algorithm is described. Since the final goal is to make it possible to use the parallel algorithm in a public administration, aspects related with the use of an heterogeneous network of Personal Computers are addressed.
https://doi.org/10.1142/9781860949630_0012
Two quantum reactive scattering computational approaches have been analyzed for parallelization. The parallel structuring of the two codes has been carried out using both the constructs based on directives of the MPI library and the skeletons defined by the SkIE coordination language.
https://doi.org/10.1142/9781860949630_0013
The scope of this paper is a presentation of the method for image reconstruction. Most interpolation methods such as polynomial interpolation, with the special case of a linear interpolation, or splines, are sensitive to the sharp change in interpolated data. Essentially non-oscillatory reconstruction adapts to the function gradient in a way that deals efficiently with the sharp edges, which are quite common in average images built by discrete data. The drawback of the method is the relatively complex algorithm, which needs a lot of logical operation. Due to the parallelisation ENO, which improved the efficiency of the method, interpolation with its high accuracy shows itself to be as a reasonable alternative to other types of interpolation.
https://doi.org/10.1142/9781860949630_0014
This paper describes a parallel mapping scheme for the gradient-descent learning algorithm. The problem we are dealing with is Time-Series forecasting by means of GRBF Networks so that (i) the network has one neuron in the output layer but, at the same time, (ii) the memory of each processor can (typically) hold the whole training set. A mapping scheme recently proposed seems to be the optimal one (the only one?) for speeding the learning up. The used approach is synchronous, so that both SIMD and MIMD parallel computers can be used for actual implementations. Results on a MIMD parallel computer are shown and commented.
https://doi.org/10.1142/9781860949630_0015
The simulation of landslide hazards is a key point in the prevention of natural disasters, since it enables to compute risk maps and helps to design protection works. We present a parallel simulator that handles debris/mud flows developed by a problem-solving environment, called CAMELot, which allows interactive simulation and steering of parallel cellular computations. CAMELot is a system that uses the cellular automata formalism to model and simulate dynamic complex phenomena on parallel machines. It combines simulation, visualisation, control and parallel processing into one tool that allows to interactively explore a simulation, visualise the state of the computation as it progresses and change parameters, resolution or representation on the fly. In the paper, we give an overview of the CAMELot system and we show its practical use for programming and steering landslides models developed in the project. Moreover, an evaluation of the performances of the simulator on the MEIKO CS-2 parallel machine is presented.
https://doi.org/10.1142/9781860949630_0016
The study performed focuses on the development and application of a parallel multiphase fluid dynamic code for the simulation of the transient 2D dynamics of pyroclastic flows, one of the most hazardous phenomena occurring during explosive volcanic eruptions. The parallelization strategy adopted is based on a domain decomposition of the real space grid using a SPMD paradigm within a message passing scheme. The speed-up of the parallel code has been tested against the number of processors, the size of the computational grid, the computational platform, and the domain decomposition criterion. The results obtained show a good scalability and portability of the parallel code and indicate a promising 3D extension of the model.
https://doi.org/10.1142/9781860949630_0017
Parallel computation techniques may bring a considerable saving in time in the computer simulation of turbomachinery flows. The implicit Navier-Stokes solver XFLOS, operating in MPI environment, takes advantage of the peculiar features of these flows. This is attained by decomposing the fluid domain into up to 16 streamwise subdomains to be solved in parallel. The explicit evaluation of spanwise fluxes at block interfaces slows down the convergence rate with the highest number of processors. The convergence rate of single processor runs may be restored by a sub-iterative solution of the implicit system in the spanwise direction. In this case, the reduction in the number of iterations to reach a fixed residual level was found to more than compensate the additional effort per iteration, and with 16 processors a speed-up of 12 at fixed residual level was achieved. The algorithm was successfully tested for the prediction of a centrifugal compressor impeller performance.
https://doi.org/10.1142/9781860949630_0018
This paper describes a newly developed code, iMPaIR, which performs iterative image deconvolution in parallel. The basic algorithm used is the Richardson-Lucy Maximum-Likelihood iterative procedure. Additionally, iMPaIR can use a spatially-variant point spread function in the deconvolution process. The basic Richardson-Lucy algorithm is described as well as details of the parallel implementation. Applications and results in the areas of astrophysical imaging and medical x-ray imaging are briefly discussed. In the medical field such restoration algorithms are impractical on single processors - computation time should measured in seconds rather than hours. We show that for this type of application a small number of processors should be able to analyse a full X-Ray image (~ 3730×3062 pixels) in less than a minute.
https://doi.org/10.1142/9781860949630_0019
In this paper we propose the new asynchronous and parallel algorithms for problem of reconstruction from a total image. They may be realized on a massively parallel computing systems consisted of independent elementary processors and a central processor. It was conducted computer simulation of these algorithms with application to problem of image reconstruction from interferogramms. Some experimental results of computer simulation compared the evaluations of errors and the rate of convergence of these algorithms are presented.
https://doi.org/10.1142/9781860949630_0020
In this paper, we present a backbone of parallel flood simulation that often leads to solving of large sparse systems of partial differential equations. Parallel simulation is essential with satisfactory accuracy in comparison with CPU-time consumed sequential methods. Our experimental results of parallel numerical solutions for flood simulation are done on Linux clusters using cyclic reduction method. We also provide experimental results done on a DMS SGI Origin2000 for comparison. Our measurements show that Linux cluster can provide satisfactory power for parallel numerical solutions, especially for large problems.
https://doi.org/10.1142/9781860949630_0021
The Xyce™ Parallel Electronic Simulator has been written to support the simulation needs of the Sandia National Laboratories electrical designers. As such, the development has focused on providing the capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). In addition, we are providing improved performance for numerical kernels using state-of-the-art algorithms, support for modeling circuit phenomena at a variety of abstraction levels and using object-oriented coding-practices that ensure maintainability and extensibility far into the future.
The code is a parallel code in the most general sense of the phrase – a message passing parallel implementation – which allows it to run efficiently on the widest possible number of computing platforms. These include serial, shared-memory and distributed-memory parallel as well as heterogeneous platforms [1]. Furthermore, careful attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved even as the number of processors grows. Thus, it is able to solve circuit problems of an unprecedented size. This paper summarizes the code architecture and some of the novel heuristics and algorithms used to address these large problems and the parallel infrastructure.
https://doi.org/10.1142/9781860949630_0022
More than a decade ago MSC offered the first parallel production system of MSC.NASTRAN. During this decade MSC has intensified its effort on parallel processing and is now ready to deliver MSC.NASTRAN V70.7 and MSC.MARC V 9, both of which contain very important new parallel features.
This paper describes these exciting features and provides preliminary performance results. We believe that these systems mark the best in parallel performance in commercial finite element analysis ever and present a breakthrough in parallel computing in our market.
https://doi.org/10.1142/9781860949630_0023
A Car-Parrinello Molecular Dynamics code has been implemented on two different Linux-based parallel platforms, based on Alpha processors. The clusters are characterized by state-of-the-art interconnection networks, enabling high throughput and low latency. The performance and the scalability of the code are reported in detail, together with the analysis of the performance of the most intensive computational tasks. Results indicate a good scalability of the code on the platforms and the adequacy of the interconnection networks to sustain a large workload.
https://doi.org/10.1142/9781860949630_0024
Computational genomics is gaining an increasing interest in the scientific community. Genomic and proteomic analysis are mainly based on character processing, so standard processors are not well suited to support the computations required, being the largest part of their silicon devoted to support memory management and floating/fixed point numerical computations. In this work we present the PRAN (PRotein ANalyser) architecture purposely designed to recognize the occurrences of a given short sequence of characters (m-peptide) within a large proteomic sequence. PRAN has been designed using the automatic parallel hardware generator developed in our research centre in the framework of the HADES project (HArdware DEsign for Scientific applications). Due to its high degree of parallelism exploited within the architecture, PRAN is able to reach high computational efficiency. PRAN has been implemented on a prototyping board equipped with a Xilinx Virtex XV1000 FPGA. Test comparisons with standard commodity off the shelf processors evidenced that PRAN is nearly one order of magnitude faster.
https://doi.org/10.1142/9781860949630_0025
This paper describes the architecture of MOSE (My Own Search Engine), a scalable parallel and distributed engine for searching the web. MOSE was specifically designed to efficiently exploit affordable parallel architectures, such as clusters of workstations. Its modular and scalable architecture can be easily adjusted to fulfill the bandwidth requirements of the application at hand. Both task-parallel and data-parallel approaches are exploited within MOSE in order to increase the throughput and efficiently use communication, storing and computational resources. We used a collection of html documents as a benchmark and conducted preliminary experiments on a cluster of three SMP Linux PCs.
https://doi.org/10.1142/9781860949630_0026
Segmentation is a fundamental task in image processing, but its hard computational requirements represent a not easily defeated drawback, mainly for tasks in which real time computing is a primary requirement, such as critical medical applications. In the present paper a parallel algorithm for image segmentation will be proposed, which allows fast convergence for active contour algorithm. A parallel computing approach to this problem has led to develop a multithreaded application based upon a parallel algorithm that allows reducing processing time. The algorithm has been implemented on a shared memory multiprocessor machine, using ANSI C++ language and Posix Threads libraries, in order to exploit code portability. Two basic schemes have been proposed for parallelism development and performance issues have been analyzed for both implementations.
https://doi.org/10.1142/9781860949630_0027
An unstructured finite volume solver in 3D has been parallelized with OpenMP. The code is used by the Swedish industry for electromagnetic computations. Results from a grid around a small aircraft show good scaling on a SUN Enterprise 6000 server with 16 processors. A parallelization strategy for a distributed memory model is also presented and some preliminary results are given.
https://doi.org/10.1142/9781860949630_0028
This paper deals with the parallelization of one of the most popular three-dimensional oceanographic model: the Princeton Ocean Model (POM). The parallelization is achieved using standard tools like MPI and OpenMP, to ensure portability of the code. A robust and efficient domain decomposition method is used to solve, in principle, large scale examples on clusters of shared memory machines. A comparison between the pure MPI and the hybrid MPI+OpenMP is shown in terms of elapsed time for a standard seamount problem varying the number of grid points and cpus.
https://doi.org/10.1142/9781860949630_0029
Molecular biologists frequently compare an unknown protein sequence with a set of other known sequences (a database scan) to detect functional similarities. Even though efficient dynamic programming algorithms exist for the problem, the required scanning time is still very high, and because of the exponential database growth finding fast solutions is of highest importance to research in this area. In this paper we present an approach to high-speed biosequence database scanning on the Fuzion 150, a new parallel computer with a linear SIMD array of 1536 processing elements on a single chip. This results in an implementation with significant runtime savings.
https://doi.org/10.1142/9781860949630_0030
The program complex is presented for numerical simulation of 3D sufficiently unsteady flows of viscous compressible heat conducting gas. This complex is based on explicit kinetically consistent finite difference schemes and designed for parallel computers. It is tested on different types of multiprocessor computer systems. The results of a test problem solution are described.
https://doi.org/10.1142/9781860949630_0031
Although lossless JPEG is inherently a sequential algorithm, in order to achieve a high speed lossless compression for medical images including motion pictures (ultrasound images) and CT scan images, it should be parallelized. We propose a scalable CODEC algorithm with conventional two approaches to parallel version of lossless JPEG, wavefront and subframe approach. We also show the preliminary performance prediction of the algorithm by using experimental CODEC model and discuss a beneficial side effect caused by the subframe approach.
https://doi.org/10.1142/9781860949630_0032
Simulating complex systems interactively yields insights which are inaccessible to theory or experiment. With increasing complexity of a system, however, the space as well as the time complexity of a simulation rises, too; therefore, interactive simulation calls for parallel processing. As an application we present the simulation of semiconductor devices in which geometry, doping, and material parameters interact in a complex manner. A short discussion of the physical effects found in modern semiconductor devices, motivates the three-dimensional hydrodynamic model equations. Using a transformation of the natural variables allows us to apply fast elliptic solvers. In order to speed-up the solution process, we reduce the number of unknowns by adaptive local grid refinements. Finally, parallel processing leads to an additional speed-up which turns out out be scalable over a wide range of processors.
https://doi.org/10.1142/9781860949630_0033
This paper focuses on the advantages derived by the use of P-AutoClass, an MIMD parallel version of the AutoClass system, for parallel data clustering. AutoClass is an unsupervised data mining system that finds optimal classes in large data sets. The P-AutoClass implementation divides the clustering task among the processors of a multicomputer so that each processor works on its own partition and exchanges intermediate results with the other processors. In particular, we point out that a main reason of P-AutoClass scalability is that communication costs do not depend on size of data to be analyzed. In fact, our analysis puts in evidence that the overhead due to data exchange among processors depends on the number of attributes and classes but not on data set size. We present theoretical considerations that support this P-AutoClass feature and illustrate some experiments showing this behavior.
https://doi.org/10.1142/9781860949630_0034
This paper describes the parallelization of a numerical solver for a class of quadratic matrix equations arising in the analysis and design of periodic linear control systems. Experimental results on a cluster of Intel PII processors, connected via a Gigabit switch and a Fast Ethernet switch, report the high performance and scalability of the approach.
https://doi.org/10.1142/9781860949630_0035
The Jacobi–Davidson (JD) algorithm was recently proposed for evaluating a number of the eigenvalues, close to a given value, of a matrix. JD goes beyond pure Krylov-space techniques by wisely expanding such space, via the solution of a so called correction system, thus in principle providing a more powerful method. Preconditioning the Jacobi-Davidson correction equation is mandatory when large, sparse matrices are analyzed. We considered several preconditioners: Classical block–Jacobi, and IC(0), together with approximate inverse (AINV) preconditioners. We found that JD is highly sensitive to preconditioning. The rationale for using AINV preconditioners is their high parallelization potential, combined with their efficiency in accelerating JD convergence. We parallelized JD by a data–splitting approach, combined with techniques for reducing the amount of communication data. We show that our code is scalable, and provides appreciable parallel degrees of computation.
https://doi.org/10.1142/9781860949630_0036
This article describes experiences on incorporating uncoordinated checkpointing and recovery facilities in a Java-based metacomputing system. Our case study is SUMA, a metasystem for execution of Java bytecode, both sequential and parallel. We have incorporated an algorithm that implements a communication-induced checkpointing protocol into SUMA. This algorithm induces processes to take additional local (forced) checkpoints and to log all in-transit messages to ensure consistent global checkpoints. Preliminary results about the performance overhead produced by the implementation of this algorithm are presented.
https://doi.org/10.1142/9781860949630_0037
An efficient parallel algorithm for the 0-1 knapsack problem is presented. The algorithm is based on dynamic programming in conjunction with dominance of states techniques. An original load balancing strategy which is designed in order to achieve good performance of the parallel algorithm is proposed. To the best of our knowledge, this is the first time for which a load balancing strategy is considered for this type of algorithm. Finally, computational results obtained with an Origin 2000 supercomputer are displayed and analyzed.
https://doi.org/10.1142/9781860949630_0038
The dynamics equation of robot manipulators is non linear and coupled. One way to solve the control movement is to use an inverse dynamic control algorithm that requires a full knowledge of the dynamics of the system. This knowledge is unusual in industrial robots, so adaptive control is used to identify these unknown parameters. In general their linear regressor is based on the linear relationship between the unknown inertial parameters and their coefficients. A formulation to generalize this relationship is applied to the Slotine-Li adaptive algorithm. The objective of this paper is to reduce computing time using parallel algorithms.
https://doi.org/10.1142/9781860949630_0039
This paper describes the direct solver for general sparse matrices in Watson Sparse Matrix Package (WSMP). In addition to incorporating the best ideas from a number of other solvers into one package, the implementation of Gaussian Elimination with Partial Pivoting (GEPP) in WSMP introduces several new techniques. Our experiments on a suite of 25 large industrial problems show that the GEPP implementation in WSMP is about two and a half times faster than the best currently available similar software.
https://doi.org/10.1142/9781860949630_0040
Retrieval with a priori extracted features limits the applicability of image databases, as a detail search for important objects cannot be performed. A dynamic retrieval solves this problem, but it also requires immense computational resources and can only be handed by parallel architectures. A simulation and an analysis of the main parallel database architectures have shown, that clusters and symmetric multiprocessors can be – dependent on the scope of the application in mind and the number of images managed – suitable for different aspects of image databases with dynamic retrieval. In this paper a prototype for the realisation of an image database on a symmetric multiprocessor system is introduced.
https://doi.org/10.1142/9781860949630_0041
Matching Pursuit Projection (MPP) is an approach to non-orthogonal transformation based compression. Similar to fractal compression, this type of encoding has to be paid with an enormous computational complexity. Despite its computational demand, MPP image and video coding has proven to be competitive in terms of rate-distortion performance. In this work we discuss approaches to parallelize a Matching Pursuit image coder. Different granularity levels and MIMD programming paradigms are compared with respect to their efficiency.
https://doi.org/10.1142/9781860949630_0042
A parallel implementation of a quasi-Newton algorithm to solve nonlinear optimization problems with linear constraints on a distributed memory multiprocessor is presented in this work. We propose parallel algorithms for the stages which take most of the computational time in the process. Results that show the high efficiency of our proposals are presented.
https://doi.org/10.1142/9781860949630_0043
In the dynamic analysis of a structure quite often a small number of the smallest eigenvalues and eigenmodes of a large and sparse general eigenvalue problem have to be determined assuming that good approximations to the demanded eigenmodes are available from previous computations. In this situation powerful approaches like Lanczos method or Jacobi-Davidson may be inferior to methods which in general are known to be slower. In this note we demonstrate this effect comparing P_ARPACK, the parallel version of ARPACK, with a parallel condensation method with general masters for a finite element model of a container ship.
https://doi.org/10.1142/9781860949630_0044
Stochastic Linear Programming (SLP) is an effective tool to deal with problems for which some of the input data are uncertain. These problems are typically characterized by a very high number of variables and constraints and the use of conventional computational resources is usually inappropriate for their solution. Parallel systems are required in order to achieve high level of efficiency in reasonable time. In this paper we propose several MPI and OpenMP implementations on the basis of which we develop a two-level parallel solver that takes advantage from the features offered by both the standards. Experimental results show that the two-level solver could be faster than any other basic algorithm.
https://doi.org/10.1142/9781860949630_0045
FDEM (Finite Difference Element Method) is a black-box solver for nonlinear systems of elliptic and parabolic PDEs. This very complex code is run on many different parallel computers. The number of processors is selected so that nearly equal theoretical peak performance results. The discussion reveals interesting properties of the code and of the computers.
https://doi.org/10.1142/9781860949630_0046
The principle problem of parallel computing, the degradation of performance by internode-communication, is aggravated on modular cluster-architectures with off-board communication cards. In this contribution we consider so-called N-squared problems, the prototype being the N-body computation with long range interaction. We demonstrate that hyper-systolic algorithms are able to enhance the performance of such computations on parallel machines by alleviating the communication bottleneck, at the cost of moderately increased memory. We determine the efficiency of implementations on the Alpha-Linux-Cluster-Engine ALiCE at Wuppertal university compared to the SIMD system APE100.
https://doi.org/10.1142/9781860949630_0047
No abstract received.
https://doi.org/10.1142/9781860949630_0048
Modern technology provides the infrastructure required to develop distributed applications able to use the power of multiple supercomputing resources and to exploit their diversity. The performance potential offered by distributed supercomputing is enormous, but it is hard to realize due to the complexity of programming in such environments. In this paper we present an approach designed to overcome this challenge, based on the Common Object Request Broker Architecture (CORBA), a successful industry standard. This approach consists in the development of a dynamic graph of distributed objects. Each of these objects represents a small encapsulated application and can be used as a building block in the construction of powerful distributed meta-applications.
https://doi.org/10.1142/9781860949630_0049
Image databases are characterised through the huge amount of data which can only be handed by parallel architectures. The requirements on the systems depend on the scope and obligate a detailed analysis of the application and the employed architecture, respectively. This paper investigates and compares the usability and efficiency of three different parallel systems with the emphasis on applications in the field of image databases. A simulation model was developed which allows the flexible design and assessment of arbitrary parallel architectures.
https://doi.org/10.1142/9781860949630_0050
We propose the addition of a shared object abstraction to a skeleton-based parallel programming language, to deal with large shared data in dynamically behaving data-driven computations. These are commonly found in the field of Data Mining. We report parallel speed-up results of a task-parallel C4.5 classifier using a Shared Tree object. With respect to this implementation, we analyse the positive impact of the methodology on the fundamental characteristics of the language: expressiveness, programmability, efficiency of applications.
https://doi.org/10.1142/9781860949630_0051
In this article, we compare two approaches for detecting dataraces in parallel shared memory programs: access histories and segment histories. We show that access histories are preferable in theory since they provide more context about the dataraces. In practice however, testing access histories using the Splash II benchmark, we found that access histories consume too much memory to be applicable for general programs. We also implemented access histories in our race detection system for Java called TRaDe. Using a novel approach that takes into account the type information still present while executing Java programs, we are able to improve the behavior of access histories so as to make them preferable to segment histories.
https://doi.org/10.1142/9781860949630_0052
Algorithmical skeletons and parallel design patterns have been developed in completely disjoint research frameworks, aimed at providing the programmer of parallel applications with an effective programming environment. At the moment different authors recognize the concrete similarities in the two approaches. In this work we try to summarize and assess common points and differences between the two approaches. From the analysis we conclude that by merging the two concepts we can develop very effective parallel programming environments. We outline the features of a new parallel programming environment inheriting from both the skeleton and design pattern related research.
https://doi.org/10.1142/9781860949630_0053
As a consequence of the growth of the Internet, new software infrastructures for distributed computing have become an effective support to solve large-scale problems by exploiting the available networked computing power. This paper presents such an infrastructure, a first definition of a Java middleware for metacomputing, called HiMM, whose main goal is to support parallel and distributed programming based on objects in heterogeneous environments. HiMM enables dynamic applications to run on collections of nodes virtually arranged according to a hierarchical network topology, mapped on wide-area physical networks of computers or clusters of computers.
https://doi.org/10.1142/9781860949630_0054
Many computationally intensive problems in engineering and science, such as those driven by Partial Differential Equations (PDEs), give rise to the solution of large and sparse linear systems of equations. Fast and efficient methods for their solution are very important because these systems usually occur in the innermost loop of the computational scheme. Parallelization is often necessary to achieve an acceptable level of performance. We report on our experience in applying a library of sparse linear algebra numerical software for parallel distributed memory machines; we experimented it from inside a well known fluid dynamics application code for engine simulation on Linux clusters.
https://doi.org/10.1142/9781860949630_0055
High performance computer systems based on commercial off the shelf components, namely Beowulf class systems, have been widely adopted to develop and test scientific and engineering applications. Each computing node of such distributed memory multicomputer is a PC or a workstation connected to the others within a system area network. In this paper we show how to harness the computing power of accelerated graphic cards available in each node and realizing an efficient, portable and scalable software library for parallel hardware rendering. A detailed performance analysis of a graphic accelerators cluster, based on the evaluation of a distributed memory VRML (Virtual Reality Modeling Language) browser, is given in terms of execution times and well known performance parameters.
https://doi.org/10.1142/9781860949630_0056
In this work we apply a quantitative data locality model, introduced in a previous paper, to a basic algebra kernel: the transposition of a sparse matrix (SpMT). The locality model is validated on the cache memory, as this is a level where the impact of locality is high and easily measurable. Matrices with a broad range of patterns have been used for the evaluation. Based on this model and in order to increase the locality we have modified the layout of data in memory applying graph based heuristic techniques based on graphs. These techniques have been compared to some standard ordering algorithms.
https://doi.org/10.1142/9781860949630_0057
In this paper we describe a code generator prototype that uses parameterized task graphs (PTGs) as an intermediate model and generates a multithreaded code. A PTG is a compact and a problem size independent representation of some task graphs found in scientific programs. We show how, with this model, we can generate a parallel program that is scalable (it is able to execute million of tasks) and generic (it works for all parameter values).
https://doi.org/10.1142/9781860949630_0058
This article describes an experimental parallel program starter for large-sized clusters of personal computers (PC). The starting of programs is parallelized by broadcasting a remote request execution. Broadcasts are performed using a spanning tree. Several experiments were conducted with N-ary and binomial trees, the latter performing better when the size of the cluster increases.
https://doi.org/10.1142/9781860949630_0059
This paper shows the use of the MetaPL notation system for the development of parallel programs. MetaPL is an XML-based Tag language, and exploits XML extension capabilities to describe programs written in different programming paradigms, interaction models and programming languages. The possibility to include timing information in the program description promotes the use of performance analysis during software development. After a description of the architecture of MetaPL, its use to describe a simple example program and to obtain two particular program descriptions (views) is proposed as a case study.
https://doi.org/10.1142/9781860949630_0060
The efficient solution of large problems is an ongoing thread of research in scientific computing. An increasingly popular method of solving these types of problems is to harness disparate computational resources and use their aggregate power as if it were contained in a single machine. We describe a prototype system allowing the data and commands exchange between different mathematical software kernels running on multiple processors of a cluster.
https://doi.org/10.1142/9781860949630_0061
This paper describes the initial results of applying structured parallel paradigms to optimise algorithms for semi-automatic digital motion picture restoration. We adopt a skeleton-based compile-time approach to optimise the frame delivery time and the sequence completion time. We use two meaningful image-processing algorithms as case studies for selecting appropriate elementary and composed stream-parallel and data-parallel constructs. As a trade-off between parallel efficiency and template complexity we propose to use a small set of basic paradigms (farm, pipe, map/reduce) and to allow just one level of composition (pipe and comp). Finally, we describe the implementation templates of the paradigms on a shared-memory architecture and present the results of some benchmarks on the parallel code obtained using the developed templates. The obtained results are consistent with the expected performance.
https://doi.org/10.1142/9781860949630_0062
Fast SIMD-parallel execution units are available in most modern microprocessors. They provide an internal parallelism degree in the range from 2 to 16 and can accelerate many data-parallel algorithms. In this paper the suitability of five different SIMD units (Intel's MMX and SSE, AMD's 3DNow!, Motorola's AltiVec and Sun's VIS) for the simulation of neural networks is compared. The appropriateness of the instruction sets for the fast implementation of typical neural network operations is analyzed and the results of an experimental performance study are presented.
https://doi.org/10.1142/9781860949630_0063
This work seeks to extend the applicability of distributed shared memory (DSM) systems by suggesting and testing an extended functionality “User Redefinable Memory Semantics” (URMS) that may be added to most DSM systems. We show how URMS is included with one DSM system, PastSet, and sketch the applicability and ease of use of URMS for a set of current applications. Finally, a micro-benchmark of URMS on PastSet, show performance gains of 79 times over a naive DSM model, 48 times over a tuple based DSM model and 83% over MPI.
https://doi.org/10.1142/9781860949630_0064
Several scientific applications that were parallelized for the CRAY T3E family of massively parallel computers make best use of the hardware and software architecture and achieve a high degree of parallel efficiency [1]. This results from a well balanced design among components (21164 processor, chip-set including stream buffers, network router, GigaRing Channel) that enables a process-to-process latency of 1 microsecond on shmem [2]. To provide a growth path to T3E users, COMPAQ has introduced the AlphaServer SC Series, and has based its supercomputing offering to the Department of Energy [3] and other scientific accounts on this technology. The SC building-blocks are AlphaServer SMP nodes and the interconnect, provided by Quadrics Supercomputer World, Ltd., is a multistage, fat tree, cross-bar switch with headroom for growth in terms of node counts, Non Uniform Memory Access-SMP nodes, microprocessor design, adapter performance and number of rails per SMP node. The current implementation yields a process-to-process Direct Memory Access latency of 3 microseconds and a bidirectional bandwidth of 200 MB/sec, which are the result of integrating the fabric adapter in kernel thereby bypassing the operating system in message traffic. These numbers will scale up to 340 MB/sec using the 66MHz PCI version, and to 500 MB/sec using the 133MHz PCI-X version.
CRAY applications leverage the following: (i) All variables are 64-bit, (ii) efficient math-routines from SCILIB, (iii) stream buffers detect and pre-fetch down small-stride reference streams, (iv) E-registers provide latency hiding and non-unit-stride access capabilities, barrier and fetch_and_op synchronization support, (v) well balanced performance across data communication routines.
Compaq has created a software development environment including a Porting Assistant tool, symbolic (parallel) debuggers, and profiling tools that complement the base functionality of the GEM optimizing compilers [4] and of the libraries for performing math tasks, message passing, etc.
Finally, a T3E application was modified by increasing the amount of (redundant) calculations to reduce communications cost for best use of the current SC architecture implementation.
https://doi.org/10.1142/9781860949630_bmatter
Author Index.