This paper discusses the performance studies of a coarse grained parallel neural network training code for control of nonlinear dynamical systems, implemented in the shared memory and message passing parallel programming environments OpenMP and MPI, respectively. In addition, these codes are compared to an implementation utilizing SHMEM the native data passing SGI/Cray environment for parallel programming. The multiprocessor platform used in the study is a SGI/Cray Origin 2000 with up to 32 processors, which supports all these programming models efficiently. The dynamical system used in this study is a nonlinear 0D model of a thermonuclear fusion reactor with the EDA-ITER design parameters. The results show that OpenMP outperforms the other two environments when large number of processors are involved, while yielding a similar or a slightly poorer behavior for small number of processors. As expected the native SGI/Cray environment outperforms MPI for the entire range of processors used. Reasons for the observed performance are given. The parallel efficiency of the code is always greater than 60% regardless of the parallel environment for the range of processors used in this study.
We report our experiences with the optimization and parallelization of a discrete element code for convex polyhedra on multi-core machines and introduce a novel variant of the sort-and-sweep neighborhood algorithm. While in theory the whole code in itself parallelizes ideally, in practice the results on different architectures with different compilers and performance measurement tools depend very much on the particle number and optimization of the code. After difficulties with the interpretation of the data for speedup and efficiency are overcome, respectable parallelization speedups could be obtained.
Many problems in Computer Science, especially in Artificial Intelligence, can be formulated as Constraint Satisfaction Problems (CSP). This paper presents a parallel implementation of the Forward-Checking algorithm for solving a binary CSP over finite domains. Its main contribution is to use a simple decomposition strategy in order to distribute dynamically the search tree among machines. The feasibility and benefit of this approach are studied for a Shared Memory model. An implementation is drafted using the new emergent standard OpenMP library for shared memory, thus controlling load balancing. We mainly highlight satisfactory efficiencies without using any tricky load balancing policy. All the experiments were carried out running on the Sillicon Graphics Origin 2000 parallel machine.
In a wide variety of scientific parallel applications, both task and data parallelism must be exploited to achieve the best possible performance on a multiprocessor machine. These applications induce task-graph parallelism with coarse-grain granularity. Nevertheless, using the available task-graph parallelism and combining it with data parallelism can increase the performance of parallel applications considerably since an additional degree of parallelism is exploited.
The OpenMP standard supports data parallelism but does not support task-graph parallelism. In this paper we present an integration of task-graph parallelism in OpenMP by extending the parallel sections constructs to include task-index and precedence-relations matrix clauses. There are many ways in which task-graph parallelism can be supported in a programming environment. A fundamental design decision is whether the programmer has to write programs with explicit precedence relations, or if the responsibility of precedence relations generation is delegated to the compiler. One of the benefits provided by parallel programming models like OpenMP is that they liberate the programmer from dealing with the underlying details of communication and synchronization, which are cumbersome and error-prone tasks. If task-graph parallelism is to find acceptance, writing task-graph parallel programs must be no harder than writing data parallel programs, and therefore, in our design, precedence relations are described through simple programmer annotations, with implementation details handled by the system.
This paper concludes with a description of several parallel application kernels that were developed to study the practical aspects of task-graph parallelism in OpenMP. The examples demonstrate that exploiting data and task parallelism in a single framework is the key to achieving good performance in a variety of applications.
The skeletal approach to the development of parallel applications has been revealed to be one of the most successful and has been widely explored in the recent years. The goal of this approach is to develop a methodology of parallel programming based on a restricted set of parallel constructs.
This paper presents llc, a parallel skeletal language, the theoretical model that gives support to the language and a prototype implementation for its compiler. The language is based on directives, uses a C-like syntax and gives support to the most widely used skeletal constructs. llCoMP is a source to source compiler for the language built on top of MPI. We evaluate the performance of our prototype compiler using four different parallel architectures and three algorithms. We present the results obtained in both shared and distributed memory architectures. Our model guarantees the portability of the language to any platform and its simplicity greatly eases its implementation.
We compare experimentally different parallelization models using MPI and OpenMP for structured adaptive mesh refinement on a shared-memory parallel computer, a SunFire 15K. Due to the dynamic properties of the mesh no static parallelization model with fixed number of processes and threads performs best in all stages. Different combinations of MPI and OpenMP are preferable in different settings of the application and grid hierarchy. We suggest a new dynamic approach using a mixed MPI-OpenMP model that adapts the number of threads during run time and gives good performance in all stages throughout the whole run as the solution state changes, i.e. the resolution in the computational grid changes.
ARMI is a communication library that provides a framework for expressing fine-grain parallelism and mapping it to a particular machine using shared-memory and message passing library calls. The library is an advanced implementation of the RMI protocol and handles low-level details such as scheduling incoming communication and aggregating outgoing communication to coarsen parallelism. These details can be tuned for different platforms to allow user codes to achieve the highest performance possible without manual modification. ARMI is used by STAPL, our generic parallel library, to provide a portable, user transparent communication layer. We present the basic design as well as the mechanisms used in the current Pthreads/OpenMP, MPI implementations and/or a combination thereof. Performance comparisons between ARMI and explicit use of Pthreads or MPI are given on a variety of machines, including an HP-V2200, Origin 3800, IBM Regatta and IBM RS/6000 SP cluster.
This paper describes our research on using Genetic Programming to obtain transition rules for Cellular Automata, which are one type of massively parallel computing system. Our purpose is to determine the existence of a limit of chaos for three dimensional Cellular Automata, empirically demonstrated for the two dimensional case. To do so, we must study statistical properties of 3D Cellular Automata over long simulation periods. When dealing with big three dimensional meshes, applying the transition rule to the whole structure can become a extremely slow task. In this work we decompose the Automata into pieces and use OpenMp to parallelize the process. Results show that using a decomposition procedure, and distributing the mesh between a set of processors, 3D Cellular Automata can be studied without having long execution times.
This paper presents a high performance parallel implementation of a hierarchical data clustering algorithm. The OpenMP programming model, either enhanced with our lightweight runtime support or through its tasking model, deals with the high irregularity of the algorithm and allows for efficient exploitation of the inherent loop-level nested parallelism. Thorough experimental evaluation demonstrates the performance scalability of our parallelization and the effective utilization of computational resources, which results in a clustering approach able to provide high quality clustering of very large datasets.
In this paper, we present OmpSs, a programming model based on OpenMP and StarSs, that can also incorporate the use of OpenCL or CUDA kernels. We evaluate the proposal on different architectures, SMP, GPUs, and hybrid SMP/GPU environments, showing the wide usefulness of the approach. The evaluation is done with six different benchmarks, Matrix Multiply, BlackScholes, Perlin Noise, Julia Set, PBPI and FixedGrid. We compare the results obtained with the execution of the same benchmarks written in OpenCL or OpenMP, on the same architectures. The results show that OmpSs greatly outperforms both environments. With the use of OmpSs the programming environment is more flexible than traditional approaches to exploit multiple accelerators, and due to the simplicity of the annotations, it increases programmer's productivity.
The polyhedral model for loop parallelization has proved to be an effective tool for advanced optimization and automatic parallelization of programs in higher-level languages. Yet, to integrate such optimizations seamlessly into production compilers, they must be performed on the compiler's internal, low-level, intermediate representation (IR). With Polly, we present an infrastructure for polyhedral optimizations on such an IR. We describe the detection of program parts amenable to a polyhedral optimization (so-called static control parts), their translation to a Z-polyhedral representation, optimizations on this representation and the generation of optimized IR code. Furthermore, we define an interface for connecting external optimizers and present a novel way of using the parallelism they introduce to generate SIMD and OpenMP code. To evaluate Polly, we compile the PolyBench 2.0 benchmarks fully automatically with PLuTo as external optimizer and parallelizer. We can report on significant speedups.
Given a mathematical expression in LaTeX or MathML format, retrieval algorithm extracts similar expressions from a database. In our previous work, we have used Longest Common Subsequence (LCS) algorithm to match two expressions of lengths, m and n, which takes O(mn) time complexity. If there are T database expressions, total complexity is O(Tmn), and an increase in T also increases this complexity. In the present work, we propose to use parallel LCS algorithm in our retrieval process. Parallel LCS has O(max(m,n)) time complexity with max(m,n) processors and total complexity can be reduced to O(Tmax(m,n)). For our experimentation, OpenMP based implementation has been used on Intel i3 processor with 4 cores. However, for smaller expressions, parallel version takes more time as the implementation overhead dominates the algorithmic improvement. As such, we have proposed to use parallel version, selectively, only on larger expressions, in our retrieval algorithm to achieve better performance. We have compared the sequential and parallel versions of our ME retrieval algorithm, and the performance results have been reported on a database of 829 mathematical expressions.
Auto-scoping in OpenMP has been proposed as a means for relieving the programmer from the non-trivial effort of identifying the data sharing attributes of variables used within code regions that produce concurrency, such as parallel and task constructs. In this work we reconsider autoscoping on parallel constructs, including combined parallel-worksharing constructs. We first show that the current implementations do not always scope variables correctly in the presence of nested parallel constructs. We then proceed to extend the set of rules that guide the autoscoping decisions so as to handle nested constructs successfully. We also discuss how this functionality is implemented in the OMPi source-to-source OpenMP compiler.
A parallel three-dimensional lattice Boltzmann scheme for multicomponent immiscible fluids is proposed to simulate bubble rising and coalescence process in viscous flows. The lattice Boltzmann scheme is based on the free-energy model and is parallelized in the share-memory model by using the OpenMP. Bubble interface is described by a diffusion interface method solving the Cahn–Hilliard equation and both the surface tension force and the buoyancy are introduced in a form of discrete body force. To avoid the numerical instability caused by the interface deformation, the 18 point finite difference scheme is utilized to calculate the first- and second-order space derivative. The correction of the parallel scheme handling three-dimensional interfaces is verified by the Laplace law and the dynamic characteristics of an isolated bubble in stationary flows. Subsequently, effects of the initially relative position, accompanied by the size ratio on bubble–bubble interaction are studied. The results show that the present scheme can effectively describe the bubble interface dynamics, even if rupture and restructure occurs. In addition to the repulsion and coalescence phenomenon due to the relative position, the size ratio also plays an insignificant role in bubble deformation and trajectory.
H.264 video decoder is a good choice for embedded video processing applications because of its higher compression ratio than MPEG2, although it has higher requirements of run-time computational resource. Multi-core system is the future of the embedded processor design for its power efficiency and multi-thread parallelization capability, and can be used to fit well with the requirements for such video processing algorithms. To simulate and evaluate the performance of these multi-core systems effectively, a design flow at the system level is developed, at the higher level, the combination of TLM language (SystemC) and shared-memory parallel programming model (OpenMP) is used for such transaction-level simulation, and at the lower level, a multi-core simulator based on the extension of the SimpleScalar 3.0 ToolSet is developed for the cycle-accurate level simulation. Compared with other high-level simulation methods, ours has the ability to realize the true-parallelization simulation. What is more, experiments show that such simulation methodology can effectively simulate these complex multi-core applications in a short time to get the appropriate core number and the task allocation strategy (much less than RTL-level simulation) and the results can get at less than 15% deviated from the ideal ones calculated based on Amadal's Law, so the parallelization strategy obtained from such simulation is the best one that can be further applied for the RTL-level design of the final multi-core system.
Usually simulations on environment flood issues will face the scalability problem of large scale parallel computing. The plain parallel technique based on pure MPI is difficult to have a good scalability due to the large number of domain partitioning. Therefore, the hybrid programming using MPI and OpenMP is introduced to deal with the issue of scalability. This kind of parallel technique can give a full play to the strengths of MPI and OpenMP. During the parallel computing, OpenMP is employed by its efficient fine grain parallel computing and MPI is used to perform the coarse grain parallel domain partitioning for data communications. Through the tests, the hybrid MPI/OpenMP parallel programming was used to renovate the finite element solvers in the BIEF library of Telemac. It was found that the hybrid programming is able to provide helps for Telemac to deal with the scalability issue.
Smoothed Particle Hydrodynamics (SPH) is fast emerging as a practically useful computational simulation tool for a wide variety of engineering problems. SPH is also gaining popularity as the back bone for fast and realistic animations in graphics and video games. The Lagrangian and mesh-free nature of the method facilitates fast and accurate simulation of material deformation, interface capture, etc. Typically, particle-based methods would necessitate particle search and locate algorithms to be implemented efficiently, as continuous creation of neighbor particle lists is a computationally expensive step. Hence, it is advantageous to implement SPH, on modern multi-core platforms with the help of High-Performance Computing (HPC) tools. In this work, the computational performance of an SPH algorithm is assessed on multi-core Central Processing Unit (CPU) as well as massively parallel General Purpose Graphical Processing Units (GP-GPU). Parallelizing SPH faces several challenges such as, scalability of the neighbor search process, force calculations, minimizing thread divergence, achieving coalesced memory access patterns, balancing workload, ensuring optimum use of computational resources, etc. While addressing some of these challenges, detailed analysis of performance metrics such as speedup, global load efficiency, global store efficiency, warp execution efficiency, occupancy, etc. is evaluated. The OpenMP and Compute Unified Device Architecture(CUDA) parallel programming models have been used for parallel computing on Intel Xeon(R) E5-2630 multi-core CPU and NVIDIA Quadro M4000 and NVIDIA Tesla p100 massively parallel GPU architectures. Standard benchmark problems from the Computational Fluid Dynamics (CFD) literature are chosen for the validation. The key concern of how to identify a suitable architecture for mesh-less methods which essentially require heavy workload of neighbor search and evaluation of local force fields from neighbor interactions is addressed.
Reverse time migration (RTM), especially that for elastic waves, consumes massive computation resources which limit its wide applications in industry. We suggest to use the pseudospectral time-domain (PSTD) method in elastic wave RTM. RTM using PSTD can significantly reduce the computational requirements compared with RTM using the traditional finite difference time domain method (FDTD). In addition to the advantage of low sampling rate with high accuracy, the PSTD method also eliminates the periodicity (or wraparound) limitation caused by fast Fourier transform in the conventional pseudospectral method. To achieve accurate results, the PSTD method needs only about half the spatial sampling rate of the twelfth-order FDTD method. Thus, the PSTD method can save up to 87.5% storage memory and 90% computation time over the twelfth-order FDTD method. We implement RTM using PSTD for elastic wave equations and accelerate it by Open Multi-Processing technology. To keep the computational load balance in parallel computation, we design a new PML layout which merges the PML in both ends of an axis together. The efficiency and imaging quality of the proposed RTM is verified by imaging on 2D and 3D models.
Traditional FPGA placement algorithms based on simulated annealing is time-consuming and thus we have proposed a parallel FPGA timing-driven placement algorithm using OpenMP + STM programming method. In this paper, we distribute swaps to multithreads by OpenMP and protect the shared memory using software transactional memory. An improved timing optimization algorithm is also added in the transaction. Experimental results on MCNC benchmarks demonstrate that our algorithm achieves a speedup of 1.6x and scales well with the increasing of threads. It also reduces the critical path delay by an average of 4.2%.
Please login to be able to save your searches and receive alerts for new content matching your search criteria.