Please login to be able to save your searches and receive alerts for new content matching your search criteria.
The ability to produce malleable parallel applications that can be stopped and reconfigured during the execution can offer attractive benefits for both the system and the applications. The reconfiguration can be in terms of varying the parallelism for the applications, changing the data distributions during the executions or dynamically changing the software components involved in the application execution. In distributed and Grid computing systems, migration and reconfiguration of such malleable applications across distributed heterogeneous sites which do not share common file systems provides flexibility for scheduling and resource management in such distributed environments. The present reconfiguration systems do not support migration of parallel applications to distributed locations. In this paper, we discuss a framework for developing malleable and migratable MPI message-passing parallel applications for distributed systems. The framework includes a user-level checkpointing library called SRS and a runtime support system that manages the checkpointed data for distribution to distributed locations. Our experiments and results indicate that the parallel applications, with instrumentation to SRS library, were able to achieve reconfigurability incurring about 15-35% overhead.
This paper describes a state-of-the-art parallel Lattice QCD Monte Carlo code for staggered fermions, purposely designed to be portable across different computer architectures, including GPUs and commodity CPUs. Portability is achieved using the OpenACC parallel programming model, used to develop a code that can be compiled for several processor architectures. The paper focuses on parallelization on multiple computing nodes using OpenACC to manage parallelism within the node, and OpenMPI to manage parallelism among the nodes. We first discuss the available strategies to be adopted to maximize performances, we then describe selected relevant details of the code, and finally measure the level of performance and scaling-performance that we are able to achieve. The work focuses mainly on GPUs, which offer a significantly high level of performances for this application, but also compares with results measured on other processors.
Open MPI's point-to-point communications abstractions, described in this paper, handle several different communications scenarios, with a portable, high-performance design and implementation. These abstractions support two types of low-level communication protocols – general purpose point-to-point communications, like the OpenIB interface, and MPI-like interfaces, such as Myricom's MX library. Support for the first type of protocols makes use of all communications resources available to a given application run, with optional support for communications error recovery. The latter provides a interface layer, relying on the communications library to guarantee correct MPI message ordering and matching. This paper describes the three point-to-point communications protocols currently supported in the Open MPI implementation, supported with performance data. This includes comparisons with other MPI implementations using the OpenIB, MX, and GM communications libraries.
Petascale parallel computers with more than a million processing cores are expected to be available in a couple of years. Although MPI is the dominant programming interface today for large-scale systems that at the highest end already have close to 300,000 processors, a challenging question to both researchers and users is whether MPI will scale to processor and core counts in the millions. In this paper, we examine the issue of scalability of MPI to very large systems. We first examine the MPI specification itself and discuss areas with scalability concerns and how they can be overcome. We then investigate issues that an MPI implementation must address in order to be scalable. To illustrate the issues, we ran a number of simple experiments to measure MPI memory consumption at scale up to 131,072 processes, or 80%, of the IBM Blue Gene/P system at Argonne National Laboratory. Based on the results, we identified nonscalable aspects of the MPI implementation and found ways to tune it to reduce its memory footprint. We also briefly discuss issues in application scalability to large process counts and features of MPI that enable the use of other techniques to alleviate scalability limitations in applications.
Usually simulations on environment flood issues will face the scalability problem of large scale parallel computing. The plain parallel technique based on pure MPI is difficult to have a good scalability due to the large number of domain partitioning. Therefore, the hybrid programming using MPI and OpenMP is introduced to deal with the issue of scalability. This kind of parallel technique can give a full play to the strengths of MPI and OpenMP. During the parallel computing, OpenMP is employed by its efficient fine grain parallel computing and MPI is used to perform the coarse grain parallel domain partitioning for data communications. Through the tests, the hybrid MPI/OpenMP parallel programming was used to renovate the finite element solvers in the BIEF library of Telemac. It was found that the hybrid programming is able to provide helps for Telemac to deal with the scalability issue.
In this paper, we propose the design and development of a fault tolerant and recovery scheme for the Message Passing Interface (MPI). The proposed scheme consists of a detection mechanism for detecting process failures, and a recovery mechanism. Two different cases are considered, both assuming the existence of a monitoring process, the Observer which triggers the recovery procedure in case of failure. In the first case, each process keeps a buffer with its own message traffic to be used in case of failure, while the implementor uses periodical tests for notification of failure by the Observer. The recovery function simulates all the communication of the processes with the dead one by re-sending to the replacement process all the messages destined for the dead one. In the second case, the Observer receives and stores all message traffic, and sends to the replacement all the buffered messages destined for the dead process. Solutions are provided to the dead communicator problem caused by the death of a process. A description of the prototype developed is provided along with the results of the experiments performed for efficiency and performance.
The program SWAN (Simulating WAves Nearshore) is a third-generation wave model used to compute the spectra of random short-crested, wind-generated waves on Eulerian grids. Presently there is no release of the SWAN code to execute on parallel computers and computations are limited to a single processor. This paper shows how to easily create a version of the SWAN code that can execute on parallel platforms using a simple approximation. The approximation is to solve the transport equation as a stationary problem at every time level. This is referred to as a quasi time-accurate approximation to differentiate it from a truly time-accurate computation where the transport equation is solved as a nonstationary problem. Using this approximation, the individual integrations at different times are independent, and coarse-grain parallelism can be easily exploited using the Message Passing Interface (MPI) parallel programming system.
The purpose of current work was twofold: to compare efficiencies of several different MD algorithms in case of their implementations on CUDA capable GPU and to study effects accompanying a coating process of contaminated copper substrate using CUDA based program. In this paper, we have discussed various aspects of CUDA technology implementation by using the real problem of molecular dynamic simulations as an example. The created CUDA based program allowed us to perform the detailed studies of the physical processes accompanying copper cluster collision with the copper substrate having one-atom layer of carbon in its top surface. It has been defined that the coating cannot be observed if the falling cluster has initial velocity lower than the critical value. Furthermore, the correlation between critical initial velocity and critical value of the angle of incidence of the copper cluster has also been observed. The comparison between execution time of CUDA MD program and MPI program based on one-dimensional parallelization with dynamic load balancing has been performed in the current work.