Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at customercare@wspc.com for any enquiries.

SEARCH GUIDE  Download Search Tip PDF File

  Bestsellers

  • articleNo Access

    PERFORMANCE EVALUATION OF PRACTICAL PARALLEL COMPUTER MODEL LogPQ

    The present super computer will be replaced by a massively parallel computer consisting of a large number of processing elements which satisfy the continuous increasing depend for computing power. Practical parallel computing model has been expected to develop efficient parallel algorithms on massively parallel computers. Thus, we have presented a practical parallel computation model LogPQ by taking account of communication queues into the LogP model. This paper addresses the performance of a parallel matrix multiplication algorithm using LogPQ and LogP models. The parallel algorithm is implemented on Cray T3E and the parallel performances are compared with on the old machine CM-5. This shows that the communication network of T3E has superior buffering behavior than CM-5, in which we don't need to prepare extra buffering on T3E. Although, a little effect remains for both of the send and receive bufferings. On the other hand, the effect of message size remains, which shows the necessity of the overhead and gap proportional to the message size.

  • articleNo Access

    PARALLELIZATION AND IMPLEMENTATION OF THE NOABL PROGRAM ON CRAY T3E PARALLEL MACHINE

    This paper presents the parallelization of the NOABL program and its implementation on CRAY T3E parallel machine. The NOABL program is utilized to simulate windfield over complex terrain. This program runs on a processor grid (from 2 to 50 processors). The results obtained, show the interest of parallelizing this program and the SLOR method.

  • articleNo Access

    COMPARATIVE PERFORMANCE STUDY OF PARALLEL PROGRAMMING MODELS IN A NEURAL NETWORK TRAINING CODE

    This paper discusses the performance studies of a coarse grained parallel neural network training code for control of nonlinear dynamical systems, implemented in the shared memory and message passing parallel programming environments OpenMP and MPI, respectively. In addition, these codes are compared to an implementation utilizing SHMEM the native data passing SGI/Cray environment for parallel programming. The multiprocessor platform used in the study is a SGI/Cray Origin 2000 with up to 32 processors, which supports all these programming models efficiently. The dynamical system used in this study is a nonlinear 0D model of a thermonuclear fusion reactor with the EDA-ITER design parameters. The results show that OpenMP outperforms the other two environments when large number of processors are involved, while yielding a similar or a slightly poorer behavior for small number of processors. As expected the native SGI/Cray environment outperforms MPI for the entire range of processors used. Reasons for the observed performance are given. The parallel efficiency of the code is always greater than 60% regardless of the parallel environment for the range of processors used in this study.

  • articleNo Access

    DETECTING SECONDARY BOTTLENECKS IN PARALLEL QUANTUM CHEMISTRY APPLICATIONS USING MPI

    Profiling tools such as gprof and ssrun are used to analyze the run-time performance of a scientific application. The profiling is done in serial and in parallel mode using MPI as the communication interface. The application is a quantum chemistry program using Hartree Fock theory and Pulays DIIS method. An extensive set of test cases is taken into account in order to reach uniform conclusions. A known problem with decreased parallel scalability can thus be narrowed down to a single subroutine responsible for the reduction in Speed Up. The critical module is analyzed and a typical pitfall with triple matrix multiplications is identified. After overhauling the critical subroutine re-examination of the run-time behavior shows significantly improved performance and markedly improved parallel scalability. The lessons learned here might be of interest to other people working in similar fields with similar problems.

  • articleNo Access

    Portable multi-node LQCD Monte Carlo simulations using OpenACC

    This paper describes a state-of-the-art parallel Lattice QCD Monte Carlo code for staggered fermions, purposely designed to be portable across different computer architectures, including GPUs and commodity CPUs. Portability is achieved using the OpenACC parallel programming model, used to develop a code that can be compiled for several processor architectures. The paper focuses on parallelization on multiple computing nodes using OpenACC to manage parallelism within the node, and OpenMPI to manage parallelism among the nodes. We first discuss the available strategies to be adopted to maximize performances, we then describe selected relevant details of the code, and finally measure the level of performance and scaling-performance that we are able to achieve. The work focuses mainly on GPUs, which offer a significantly high level of performances for this application, but also compares with results measured on other processors.

  • articleNo Access

    COMPILING DATA-PARALLEL PROGRAMS TO A DISTRIBUTED RUNTIME ENVIRONMENT WITH THREAD ISOMIGRATION

    The compilation of data-parallel languages is traditionally targeted to low-level runtime environments: abstract processors are mapped onto static system processes, which directly address the low-level communication library. Alternatively, we propose to map each HPF abstract processor onto a "lightweight process" (thread) which can be dynamically migrated between nodes together with the data it manages, under the supervision of some external scheduler. We discuss the pros and cons of such an approach and the facilities which must be provided by the multithreaded runtime. We describe a prototype HPF compiling system built along these lines, based on the Adaptor HPF compiler and using the PM2 multithreaded runtime environment.

  • articleNo Access

    MPI-FT: PORTABLE FAULT TOLERANCE SCHEME FOR MPI

    In this paper, we propose the design and development of a fault tolerant and recovery scheme for the Message Passing Interface (MPI). The proposed scheme consists of a detection mechanism for detecting process failures, and a recovery mechanism. Two different cases are considered, both assuming the existence of a monitoring process, the Observer which triggers the recovery procedure in case of failure. In the first case, each process keeps a buffer with its own message traffic to be used in case of failure, while the implementor uses periodical tests for notification of failure by the Observer. The recovery function simulates all the communication of the processes with the dead one by re-sending to the replacement process all the messages destined for the dead one. In the second case, the Observer receives and stores all message traffic, and sends to the replacement all the buffered messages destined for the dead process. Solutions are provided to the dead communicator problem caused by the death of a process. A description of the prototype developed is provided along with the results of the experiments performed for efficiency and performance.

  • articleNo Access

    NET-CONSOLE: WEB-BASED DEVELOPMENT ENVIRONMENT FOR PARALLEL PROGRAMS

    Net-Console is an integrated program development environment that can be used as a front-end for High Performance Computing (HPC) sites. It consist of an MPI-aware editor, an execution console, a debugger, monitoring tools and an account and file manager. Using Net-console the user is able to edit, execute, debug and evaluate the performance of parallel programs from anywhere in the Internet. The user interface is done through a normal Java-enabled browser. Net-console can also support processing through wireless and lightweight devices with the use of mobile agent technologies. The tools included in Net-console and their functionality, the languages used and the overall structure of the project are presented in this paper.

  • articleNo Access

    SRS: A FRAMEWORK FOR DEVELOPING MALLEABLE AND MIGRATABLE PARALLEL APPLICATIONS FOR DISTRIBUTED SYSTEMS

    The ability to produce malleable parallel applications that can be stopped and reconfigured during the execution can offer attractive benefits for both the system and the applications. The reconfiguration can be in terms of varying the parallelism for the applications, changing the data distributions during the executions or dynamically changing the software components involved in the application execution. In distributed and Grid computing systems, migration and reconfiguration of such malleable applications across distributed heterogeneous sites which do not share common file systems provides flexibility for scheduling and resource management in such distributed environments. The present reconfiguration systems do not support migration of parallel applications to distributed locations. In this paper, we discuss a framework for developing malleable and migratable MPI message-passing parallel applications for distributed systems. The framework includes a user-level checkpointing library called SRS and a runtime support system that manages the checkpointed data for distribution to distributed locations. Our experiments and results indicate that the parallel applications, with instrumentation to SRS library, were able to achieve reconfigurability incurring about 15-35% overhead.

  • articleNo Access

    LLC: A PARALLEL SKELETAL LANGUAGE

    The skeletal approach to the development of parallel applications has been revealed to be one of the most successful and has been widely explored in the recent years. The goal of this approach is to develop a methodology of parallel programming based on a restricted set of parallel constructs.

    This paper presents llc, a parallel skeletal language, the theoretical model that gives support to the language and a prototype implementation for its compiler. The language is based on directives, uses a C-like syntax and gives support to the most widely used skeletal constructs. llCoMP is a source to source compiler for the language built on top of MPI. We evaluate the performance of our prototype compiler using four different parallel architectures and three algorithms. We present the results obtained in both shared and distributed memory architectures. Our model guarantees the portability of the language to any platform and its simplicity greatly eases its implementation.

  • articleNo Access

    DYNAMIC STREAMS FOR EFFICIENT COMMUNICATIONS BETWEEN MIGRATING PROCESSES IN A CLUSTER

    This paper presents a communication system designed to allow efficient process migration in a cluster. The proposed system is generic enough to allow the migration of any kind of stream: socket, pipe, char devices. Communicating processes using IP or Unix sockets are transparently migrated with our mechanisms and they can still efficiently communicate after migration. The designed communication system is implemented as part of Kerrighed, a single system image operating system for a cluster based on Linux. Preliminary performance results are presented.

  • articleNo Access

    EXPERIMENTAL RESULTS ABOUT MPI COLLECTIVE COMMUNICATION OPERATIONS

    Collective communication performance is critical in a number of MPI applications. In this paper we focus on two widely used primitives, broadcast and reduce, and present experimental results obtained on a cluster of PC connected by InfiniBand. We integrated our algorithms in the MPICH library and we used MPICH implementation of broadcast and reduce primitives to compare the performance of our algorithms based on α-trees. Our tests show that the MPICH implementation can be improved.

  • articleNo Access

    ARMI: A High Level Communication Library for STAPL

    ARMI is a communication library that provides a framework for expressing fine-grain parallelism and mapping it to a particular machine using shared-memory and message passing library calls. The library is an advanced implementation of the RMI protocol and handles low-level details such as scheduling incoming communication and aggregating outgoing communication to coarsen parallelism. These details can be tuned for different platforms to allow user codes to achieve the highest performance possible without manual modification. ARMI is used by STAPL, our generic parallel library, to provide a portable, user transparent communication layer. We present the basic design as well as the mechanisms used in the current Pthreads/OpenMP, MPI implementations and/or a combination thereof. Performance comparisons between ARMI and explicit use of Pthreads or MPI are given on a variety of machines, including an HP-V2200, Origin 3800, IBM Regatta and IBM RS/6000 SP cluster.

  • articleNo Access

    Open MPI: A High Performance, Flexible Implementation of MPI Point-to-Point Communications

    Open MPI's point-to-point communications abstractions, described in this paper, handle several different communications scenarios, with a portable, high-performance design and implementation. These abstractions support two types of low-level communication protocols – general purpose point-to-point communications, like the OpenIB interface, and MPI-like interfaces, such as Myricom's MX library. Support for the first type of protocols makes use of all communications resources available to a given application run, with optional support for communications error recovery. The latter provides a interface layer, relying on the communications library to guarantee correct MPI message ordering and matching. This paper describes the three point-to-point communications protocols currently supported in the Open MPI implementation, supported with performance data. This includes comparisons with other MPI implementations using the OpenIB, MX, and GM communications libraries.

  • articleNo Access

    EXPERIMENTAL EVALUATION OF BSP PROGRAMMING LIBRARIES

    The model of bulk-synchronous parallel computation (BSP) helps to implement portable general purpose algorithms while maintaining predictable performance on different parallel computers. Nevertheless, when programming in ‘BSP style’, the running time of the implementation of an algorithm can be very dependent on the underlying communication library. In this study, an overview of existing approaches to practical BSP programming in C/C++ or Fortran is given and benchmarks were run for the two main BSP-like communication libraries, the Oxford BSP Toolset and PUB. Furthermore, a memory efficient matrix multiplication algorithm was implemented and used to compare their performance on different parallel computers and to evaluate the compliance with predictions by theoretical results.

  • articleNo Access

    ON IMPLEMENTING THE FARM SKELETON

    Algorithmic skeletons intend to simplify parallel programming by providing a higher level of abstraction compared to the usual message passing. Task and data parallel skeletons can be distinguished. In the present paper, we will consider several approaches to implement one of the most classical task parallel skeletons, namely the farm, and compare them w.r.t. scalability, overhead, potential bottlenecks, and load balancing. We will also investigate several communication modes for the implementation of skeletons. Based on experimental results, the advantages and disadvantages of the different approaches are shown. Moreover, we will show how to terminate the system of processes properly.

  • articleNo Access

    RELATIONSHIPS BETWEEN REGULAR AND IRREGULAR COLLECTIVE COMMUNICATION OPERATIONS ON CLUSTERED MULTIPROCESSORS

    We characterize collective communication operations on (clustered) multiprocessor systems in terms of their communication volume, and arrive at useful relationships between regular and irregular operations over sets of processors and sets of cluster-nodes, respectively. We show that regular problems over sets of processors induce corresponding irregular problems over sets of nodes. We hereby identify a symmetric variant of the personalized all-to-all communication problem that might be worth studying in its own right, and discuss an algorithm for solving this problem. From a simple algorithm for the regular all-gather problem over sets of processors, we derive an algorithm for the irregular all-gather problem over both sets of processors and sets of nodes. For communication libraries like MPI, the relationships emphasize the need for efficient algorithms for the irregular collective communication operations.

  • articleNo Access

    QUANTIFYING NETWORK CONTENTION ON LARGE PARALLEL MACHINES

    In the early years of parallel computing research, significant theoretical studies were done on interconnect topologies and topology aware mapping for parallel computers. With the deployment of virtual cut-through, wormhole routing and faster interconnects, message latencies reduced and research in the area died down. This article shows that network topology has become important again with the emergence of very large supercomputers, typically connected as a 3D torus or mesh. It presents a quantitative study on the effect of contention on message latencies on torus and mesh networks.

    Several MPI benchmarks are used to evaluate the effect of hops (links) traversed by messages, on their latencies. The benchmarks demonstrate that when multiple messages compete for network resources, link occupancy or contention can increase message latencies by up to a factor of 8 times on some architectures. Results are shown for three parallel machines – ANL's IBM Blue Gene/P (Surveyor), RNL's Cray XT4 (Jaguar) and PSC's Cray XT3 (BigBen). Findings in this article suggest that application developers should now consider interconnect topologies when mapping tasks to processors in order to obtain the best performance on large parallel machines.

  • articleNo Access

    MPI ON MILLIONS OF CORES

    Petascale parallel computers with more than a million processing cores are expected to be available in a couple of years. Although MPI is the dominant programming interface today for large-scale systems that at the highest end already have close to 300,000 processors, a challenging question to both researchers and users is whether MPI will scale to processor and core counts in the millions. In this paper, we examine the issue of scalability of MPI to very large systems. We first examine the MPI specification itself and discuss areas with scalability concerns and how they can be overcome. We then investigate issues that an MPI implementation must address in order to be scalable. To illustrate the issues, we ran a number of simple experiments to measure MPI memory consumption at scale up to 131,072 processes, or 80%, of the IBM Blue Gene/P system at Argonne National Laboratory. Based on the results, we identified nonscalable aspects of the MPI implementation and found ways to tune it to reduce its memory footprint. We also briefly discuss issues in application scalability to large process counts and features of MPI that enable the use of other techniques to alleviate scalability limitations in applications.

  • articleNo Access

    REDUCING THE BULK IN THE BULK SYNCHRONOUS PARALLEL MODEL

    For over two decades the dominant means for enabling portable performance of computational science and engineering applications on parallel processing architectures has been the bulk-synchronous parallel programming (BSP) model. Code developers, motivated by performance considerations to minimize the number of messages transmitted, have typically pursued a strategy of aggregating message data into fewer, larger messages. Emerging and future high-performance architectures, especially those seen as targeting Exascale capabilities, provide motivation and capabilities for revisiting this approach. In this paper we explore alternative configurations within the context of a large-scale complex multi-physics application and a proxy that represents its behavior, presenting results that demonstrate some important advantages as the number of processors increases in scale.