Please login to be able to save your searches and receive alerts for new content matching your search criteria.
In this paper, we show a simple lossless compression heuristic for gray scale images. The main advantage of this approach is that it provides a highly parallelizable compressor and decompressor. In fact, it can be applied independently to each block of 8×8 pixels, achieving 80 percent of the compression obtained with LOCO-I (JPEG-LS), the current lossless standard in low-complexity applications. The compressed form of each block employs a header and a fixed length code, and the sequential implementations of the encoder and decoder are 50 to 60 percent faster than LOCO-I.
This paper presents the parallelization of the NOABL program and its implementation on CRAY T3E parallel machine. The NOABL program is utilized to simulate windfield over complex terrain. This program runs on a processor grid (from 2 to 50 processors). The results obtained, show the interest of parallelizing this program and the SLOR method.
We implemented a parallel Swendsen–Wang algorithm for a 2D Ising system without magnetization in a host–node programming model. The simulations were performed on the Intel Hypercube IPSC/860. Our maximum number of updates/s on 32 nodes ist three times as high as in the implementation by Stauffer and Kertész on the same machine. With 32 processors we reach half the speed of the simulations by Tamayo and Flanigan on 256 nodes of a CM5. We discuss the non–equilibrium relaxation for the energy and the magnetization.
The parallelization aspect of the Ising Monte Carlo simulation is discussed. It is shown that most of the theoretically interesting simulations now are suitable for the trivial parallelization, that is, the Ising simulation is ideally parallelizable. Furthermore, presently most efficient simulation algorithm for single processor is also a kind of trivial paralellization. Results on the non-equilibrium critical relaxation study is included as an example.
For the solution of discretized ordinary or partial differential equations it is necessary to solve systems of equations with coefficient matrices of different sparsity pattern, depending on the discretization method; using the finite element method (FE) results in largely unstructured systems of equations. A frequently used iterative solver for systems of equations is the method of conjugate gradients (CG) with different preconditioners. On a multiprocessor system with distributed memory, in particular the data distribution and the communication scheme depending on the used data struture are of greatest importance for the efficient execution of this method. Here, a data distribution and a communication scheme are presented which are based on the analysis of the column indices of the non-zero matrix elements. The performance of the developed parallel CG-method was measured on the distributed-memory-system INTEL iPSC/860 of the Research Centre Jülich with systems of equations from FE-models. The parallel CG-algorithm has been shown to be well suited for both regular and irregular discretization meshes, i.e. for coefficient matrices of very different sparsity pattern.
We describe an efficient Particle-Mesh algorithm for the Connection Machine CM-5. Our particular method parallelizes well and the computation time per time step decreases as the particles become more clustered. We achieve floating-point computation rates of 4–5 MFlops/processing node and total operations (the sum of floating-point and integer arithmetic plus communications) of 5–10 MOps/sec/processing node. The rates scale almost linearly from 32 to 256 processors. Although some of what we discuss is specific to the CM-5, many aspects (e.g., the computation of the force on a mesh) are generic to all implementations, and other aspects (e.g., the algorithm for assignment of the density to the mesh) are useful on any parallel computer.
A lattice-gas Automata two-dimensional program was developed for analysis of single and two-phase flow behaviors, to support the development of integrated software modules for Nuclear Power Plant mechanistic simulations. The program has single-color, which includes FHP I, II, and III models, two-color (Immiscible lattice gas), and two-velocity methods including a gravity effect model. Parameter surveys have been performed for Karman vortex street, two-phase separation for understanding flow regimes, and natural circulation flow for demonstrating passive reactor safety due to the chimney structure vessel. In addition, lattice-Boltzmann Equation two-dimensional programs were also developed. For analyzing single-phase flow behavior, a lattice-Boltzmann-BGK program was developed, which has multi-block treatments. A Finite Differential lattice-Boltzmann Equation program of parallelized version was introduced to analyze boiling two-phase flow behaviors. Parameter surveys have been performed for backward facing flow, Karman vortex street, bent piping flow with/without obstacles for piping system applications, flow in the porous media for demonstrating porous debris coolability, Couette flow, and spinodal decomposition to understand basic phase separation mechanisms. Parallelization was completed by using a domain decomposition method for all of the programs. An increase in calculation speed of at least 25 times, by parallel processing on 32 processors, demonstrated high parallelization efficiency. Application fields for microscopic model simulation to hypothetical severe conditions in large plants were also discussed.
We present computational aspects of Molecular Dynamics calculations of thermal properties of diamond using the Brenner potential. Parallelization was essential in order to carry out these calculations on samples of suitable sizes. Our implementation uses MPI on a multi-processor machine such as the IBM SP2. Three aspects of parallelization of the Brenner potential are discussed in depth. These are its long-range nature, the need for different parallelization algorithms for forces and neighbors, and the relative expense of force calculations compared to that of data communication. The efficiency of parallelization is presented as a function of different approaches to these issues as well as of cell size and number of processors employed in the calculation. In the calculations presented here, information from almost half of the atoms were needed by each processor even when 16 processors were used. This made it worthwhile to avoid unnecessary complications by making data from all atoms available to all processors. Superlinear speedup was achieved for four processors (by avoiding paging) with 512 atom samples, and 5ps long trajectories were calculated (for 5120 atom samples) in 53 hours using 16 processors; 514 hours would have been needed to complete this calculation using a serial program. Finally, we discuss and make available a set of routines that enable MPI-based codes such as ours to be debugged on scalar machines.
Programming with parallel skeletons is an attractive framework because it encourages programmers to develop efficient and portable parallel programs. However, extracting parallelism from sequential specifications and constructing efficient parallel programs using the skeletons are still difficult tasks.
In this paper, we propose an analytical approach to transforming recursive functions on general recursive data structures into compositions of parallel skeletons. Using static slicing, we have defined a classification of subexpressions based on their data-parallelism. Then, skeleton-based parallel programs are generated from the classification. To extend the scope of parallelization, we have adopted more general parallel skeletons which do not require the associativity of argument functions. In this way, our analytical method can parallelize recursive functions with complex data flows.
We propose the higher-order functional style for the parallel programming of algorithms. The functional language , a subset of the language Haskell, facilitates the clean integration of skeletons into a functional program. Skeletons are predefined programming schemata with an efficient parallel implementation. We report on our compiler, which translates
programs into C+MPI, especially on the design decisions we made. Two small examples, the n queens problem and Karatsuba's polynomial multiplication, are presented to demonstrate the programming comfort and the speedup one can obtain.
Programming with parallel skeletons is an attractive framework because it encourages programmers to develop efficient and portable parallel programs. However, extracting parallelism from sequential specifications and constructing efficient parallel programs using the skeletons are still difficult tasks. In this paper, we propose an analytical approach to transforming recursive functions on general recursive data structures into compositions of parallel skeletons. Using static slicing, we have defined a classification of subexpressions based on their data-parallelism. Then, skeleton-based parallel programs are generated from the classification. To extend the scope of parallelization, we have adopted more general parallel skeletons which do not require the associativity of argument functions. In this way, our analytical method can parallelize recursive functions with complex data flows.
Metaprogramming is a paradigm for enhancing a general-purpose programming language with features catering for a special-purpose application domain, without a need for a reimplementation of the language. In a staged compilation, the special-purpose features are translated and optimised by a domain-specific preprocessor, which hands over to the general-purpose compiler for translation of the domain-independent part of the program. The domain we work in is high-performance parallel computing. We use metaprogramming to enhance the functional language Haskell with features for the efficient, parallel implementation of certain computational patterns, called skeletons.
Monte Carlo simulations are increasingly used in medical physics. In scintigraphic imaging these simulations are used to model imaging systems and to develop and assess tomographic reconstruction algorithms and correction methods for improved image quantization. In radiotherapy-brachytherapy the goal is to evaluate accurately the dosimetry in complex phantoms and at interfaces of tissue, where analytic calculations have shown some limits. The main drawback of Monte Carlo simulations is their high computing time. The aim of our research is to reduce the computing time by parallelizing a simulation on geographically distributed processors. The method is based on the parallelization of the Random Number Generator (RNG) used in Monte Carlo simulations. The long serial of numbers used by the sequential simulation is split. Once the partitioning is done, a software application allows the user to generate automatically the files describing each simulation part. Finally, another software executes them on the DataGrid testbed using an API. All these steps have been made transparent for the user by providing a web page asking the user for all the parameters necessary to launch the simulation and retrieve results. Different tests have been done in order to show first, the reliability of the physical results obtained by concatenation of parallelized output data and secondly the time gained for jobs execution.
While tree contraction algorithms play an important role in efficient tree computation in parallel, it is difficult to develop such algorithms due to the strict conditions imposed on contracting operators. In this paper, we propose a systematic method of deriving efficient tree contraction algorithms from recursive functions on trees. We identify a general recursive form that can be parallelized into efficient tree contraction algorithms, and present a derivation strategy for transforming general recursive functions to the parallelizable form. We illustrate our approach by deriving a novel parallel algorithm for the maximum connected-set sum problem on arbitrary trees, the tree-version of the well-known maximum segment sum problem.
The design of fast arithmetic logic circuits is an important research topic for reversible and quantum computing. A special challenge in this setting is the computation of standard arithmetical functions without the generation of garbage.
Here, we present a novel parallelization scheme wherein m parallel k-bit reversible ripple-carry adders are combined to form a reversible mk-bit ripple-block carry adder with logic depth for a minimal logic depth
, thus improving on the mk-bit ripple-carry adder logic depth
. The underlying mechanisms of the parallelization scheme are formally proven correct. We also show designs for garbage-less reversible comparison circuits.
We compare the circuit costs of the resulting ripple-block carry adder with known optimized reversible ripple-carry adders in measures of circuit delay, width, gate, transistor count, and relative power efficiency, and find that the parallelized adder offers significant speedups at realistic word sizes with modest parallelization overhead.
This paper presents a high performance parallel implementation of a hierarchical data clustering algorithm. The OpenMP programming model, either enhanced with our lightweight runtime support or through its tasking model, deals with the high irregularity of the algorithm and allows for efficient exploitation of the inherent loop-level nested parallelism. Thorough experimental evaluation demonstrates the performance scalability of our parallelization and the effective utilization of computational resources, which results in a clustering approach able to provide high quality clustering of very large datasets.
We propose a functional program skeleton for balanced fixed-degree divide-and-conquer and a method for its parallel implementation on message-passing multiprocessors. In the method, the operations of the skeleton are first mapped to a geometric computational model which is then mapped to space-time in order to expose the inherent parallelism. This approach is inspired by the method of parallelizing nested loops in the polytope model.
It is widely recognized that a key problem of parallel computation is in the development of both efficient and correct parallel software. Although many advanced language features and compilation techniques have been proposed to alleviate the complexity of parallel programming, much effort is still required to develop parallelism in a formal and systematic way. In this paper, we intend to clarify this point by demonstrating a formal derivation of a correct but efficient homomorphic parallel algorithm for a simple language recognition problem known as bracket matching. To the best of our knowledge, our formal derivation leads to a novel divide-and-conquer parallel algorithm for bracket matching.
Parallel programming continues to be difficult and error-prone, whether starting from specifications or from an existing sequential program. This paper presents (1) a methodology for parallelizing sequential applications and (2) experiments in applying the methodology. The methodology is based on the use of stepwise refinement together with what we call parallel programming archetypes (briefly, abstractions that capture common features of classes of programs), in which most of the work of parallelization is done using familiar sequential tools and techniques, and those parts of the process that cannot be addressed with sequential tools and techniques are addressed with formally-justified transformations. The experiments consist of applying the methodology to sequential application programs, and they provide evidence that the methodology produces correct and reasonably efficient programs at reasonable human-effort cost. Of particular interest is the fact that the aspect of the methodology that is most completely formally justified is the aspect that in practice was the most trouble-free.
In view of the fact that the existing intrusion detection system (IDS) based on clustering algorithm cannot adapt to the large-scale growth of system logs, a K-mediods clustering intrusion detection algorithm based on differential evolution suitable for cloud computing environment is proposed. First, the differential evolution algorithm is combined with the K-mediods clustering algorithm in order to use the powerful global search capability of the differential evolution algorithm to improve the convergence efficiency of large-scale data sample clustering. Second, in order to further improve the optimization ability of clustering, a dynamic Gemini population scheme was adopted to improve the differential evolution algorithm, thereby maintaining the diversity of the population while improving the problem of being easily trapped into a local optimum. Finally, in the intrusion detection processing of big data, the optimized clustering algorithm is designed in parallel under the Hadoop Map Reduce framework. Simulation experiments were performed in the open source cloud computing framework Hadoop cluster environment. Experimental results show that the overall detection effect of the proposed algorithm is significantly better than the existing intrusion detection algorithms.