In this paper we argue that the next generation of supercomputers will be based on tight-knit clusters of symmetric multiprocessor systems in order to: (i) provide higher capacity at lower cost; (ii) enable easy future expansion, and (iii) ease the development of computational science applications. This strategy involves recognizing that the current vector supercomputer user community divides (roughly) into two groups, each of which will benefit from this approach: One, the "capacity" users (who tend to run production codes aimed at solving the science problems of today) will get better throughput than they do today by moving to large symmetric multiprocessor systems (SMPs), and a second group, the "capability" users (who tend to be developing new computational science techniques) will invest the time needed to get high performance from cluster-based parallel systems.
In addition to the technology-based arguments for the strategy, we believe that it also supports a vision for a revitalization of scientific computing. This vision is that an architecture based on commodity components and computer science innovation will: (i) enable very scalable high performance computing to address the high-end computational science requirements; (ii) provide better throughput and a more productive code development environment for production supercomputing; (iii) provide a path to integration with the laboratory and experimental sciences, and (iv) be the basis of an on-going collaboration between the scientific community, the computing industry, and the research computer science community in order to provide a computing environment compatible with production codes and dynamically increasing in both hardware and software capability and capacity.
We put forward the thesis that the current level of hardware performance and sophistication of the software environment found in commercial symmetric multiprocessor (SMP) systems, together with advances in distributed systems architectures, make clusters of SMPs one of the highest-performance, most cost-effective approaches to computing available today. The current capacity users of the C90-like system will be served in such an environment by having more of several critical resources than the current environment provides: much more CPU time per unit of real time, larger memory per node and much larger memory per cluster; and the capability users are served by an MPP-like performance and an architecture that enables continuous growth into the future. In addition to these primary arguments, secondary advantages of SMP clusters include: the ability to replicate this sort of system in smaller units to provide identical computing environments at the home sites and laboratories of scientific users; the future potential for using the global Internet for interconnecting large clusters at a central facility with smaller clusters at other sites to form a very high capability system; and a rapidly growing base of supporting commercial software.
The arguments made to support this thesis are as follows:
(1) Workstation vendors are increasingly turning their attention to parallelism in order to run increasingly complex software in their commercial product lines. The pace of development by the "workstation" manufacturers due to their very-large investment in research and development for hardware and software is so rapid that the special-purpose research aimed at just the high-performance market is no longer able to produce significant advantages over the mass-market products. We illustrate this trend and analyze its impact on the current performance of SMPs relative to vector supercomputers.
(2) Several factors also suggest that "clusters" of SMPs will shortly out-perform traditional MPPs for reasons similar to those mentioned above. The mass-produced network architectures and components being used to interconnect SMP clusters are experiencing technology and capability growth trends similar to commodity computing systems. This is due to the economic drivers of the merging of computing and telecommunications technology, and the greatly increased demand for high bandwidth data communication. Very-high-speed general-purpose networks are now being produced for a large market, and the technology is experiencing the same kinds of rapid advances as workstation processor technology. The engineering required to build MPPs from special-purpose networks that are integrated in special ways with commercial microprocessors is costly and requires long engineering lead times. This results in delivered MPPs with less capable processors than are being delivered in workstations at the same time.
(3) Commercial software now exists that provides integrated, MPP-style code development and system management for clusters of SMPs, and software architectures and components that will provide even more homogeneous views of clusters of SMPs are now emerging from several academic research groups.
We propose that the next-generation scientific supercomputer center be built from clusters of SMPs, and suggest a strategy for an initial 50 Gflop configuration and incremental increases thereafter to reach a teraflop by just after the turn of the century. While this cluster uses what is called "network of workstations" technology, the individual nodes are, in and of themselves, powerful systems that typically have several gigaflops of CPU and several gigabytes of memory.
The risks of this approach are analyzed, and found to be similar to those of MPPs. That is, the risks are primarily in software issues that are similar for SMPs and MPPs: namely, in the provision of a homogenous view of a distributed memory system. The argument is made that the capacity of today's large SMPs, taken together with already existing distributed systems software, will provide a versatile and powerful computational science environment. We also address the issues of application availability and code conversion to this new environment even if the homogeneous cluster software environment does not mature as quickly as expected.
The throughput of the proposed SMP cluster architecture is substantial. The job mix is more easily load balanced because of the substantially greater memory size of the proposed cluster implementation as compared to a typical C90. The larger memory allows more jobs to be in the active schedule queue (in memory waiting to execute), and the larger "local" disk capacity of the cluster allows more data and results storage area for executing jobs.
We derive efficient guidelines for scheduling data-parallel computations within a draconian mode of cycle-stealing in networks of workstations. In this computing regimen, (the owner of) workstation A contracts with (the owner of) workstation B to take control of B's processor for a guaranteed total of U time units, punctuated by up to some prespecified number p of interrupts which kill any work A has in progress on B. On the one hand, the high overhead — of c time units — for setting up the communications that supply workstation B with work and receive its results recommends that A communicate with B infrequently, supplying B with large amounts of work each time. On the other hand, the risk of losing work in progress when workstation B is interrupted recommends that A supply B with a long sequence of small bundles of work. In this paper, we derive two sets of scheduling guidelines that balance these conflicting pressures in a way that optimizes, up to low-order additive terms, the amount of work that A is guaranteed to accomplish during the cycle-stealing opportunity. Our non-adaptive guidelines, which employ a single fixed strategy until all p interrupts have occurred, produce schedules that achieve at least units of work. Our adaptive guidelines, which change strategy after each interrupt, produce schedules that achieve at least
(low-order terms) units of work. By deriving the theoretical underpinnings of our guidelines, we show that our non-adaptive schedules are optimal in guaranteed work-output and that our adaptive schedules are within low-order additive terms of being optimal.
Prediction is a critical component in the achievement of application execution performance. The development of adequate and accurate prediction models is especially difficult in local-area clustered environments where resources are distributed and performance varies due to the presence of other users in the system. This paper discusses the use of stochastic values to parameterize cluster application performance models. Stochastic values represent a range of likely behavior and can be used effectively as model parameters. We describe two representations for stochastic model parameters and demonstrate their effectiveness in predicting the behavior of several applications under different workloads on a contended network of workstations.
Heterogeneity complicates the use of multicomputer platforms. Can it also enhance their performance? How can one measure the power of a heterogeneous assemblage of computers ("cluster"), in absolute terms (how powerful is this cluster) and relative terms (which cluster is more powerful)? Is a cluster that has one super-fast computer and the rest of "average" speed more/less powerful than one all of whose computers are "moderately" fast? If you can replace just one computer in a cluster with a faster one, should you replace the fastest? the slowest? A result concerning "worksharing" in heterogeneous clusters provides a highly idealized, yet algorithmically meaningful, framework for studying such questions in a way that admits rigorous analysis and formal proof. We encounter some surprises as we answer the preceding questions (perforce, within the idealized framework). Highlights: (1) If one can replace only one computer in a cluster by a faster one, it is (almost) always most advantageous to replace the fastest one. (2) If the computers in two clusters have the same mean speed, then the cluster with the larger variance in speed is (almost) always more productive (verified analytically for small clusters and empirically for large ones.) (3) Heterogeneity can actually enhance a cluster's computing power.
Beowulf class clusters are gaining more and more interest as low cost parallel architectures. They deliver reasonable performance at a very reasonable cost, compared to classical MPP machines. Parallel applications are usually developed on clusters using MPI/PVM message passing or HPF programming environments. Here we discuss new implementation strategies to support structured parallel programming environments for clusters based on skeletons. The adoption of structured parallel programming models greatly reduces the time spent in developing new parallel applications on clusters. The adoption of our implementation techniques based on macro data flow allows very efficient parallel applications to be developed on clusters. We discuss experiments that demonstrate the full feasibility of the approach.
Coscheduling of communication and computation is considered one of the crucial points to obtain good performance out of fast communication systems. Various techniques have been examined in literature ranging from strict "gang scheduling" of all processes possibly involved in message exchange to "implicit coscheduling" in which the communication support system may act on the scheduling of sending and receiving processes trying to improve performance without explicit coordination by special purpose message exchanges. Based on the experience in implementing the GAMMA communication system, we are convinced that some form of coscheduling is needed in order to obtain best performance in communication. However we believe that most of the approaches described in literature so far are too simplistic to be really effective. In this paper we point out and classify some of the major problems a system that attempts to coschedule communication and computation should address. We hope to clarify the goals of a coscheduler by taking some of the crucial characteristics of the communication into account. We also hope to be able to devise some more integrated and coherent strategies of coordination between process scheduling and choice of communication modes.
In this paper, we present fundamental mechanisms for global process and memory management in an efficient single system image cluster operating system designed to execute workloads composed of high performance sequential and parallel applications. Their implementation in Kerrighed, our proposed distributed operating system, is composed of a set of Linux modules and a patch of less than 200 lines of code to the Linux kernel. Kerrighed is a unique single system image cluster operating system providing the standard Unix interface as well as distributed OS mechanisms such as load balancing on all cluster nodes. Our support for standard Unix interface includes support for multi-threaded applications and a checkpointing facility for both sequential and shared memory parallel applications. We present an experimental evaluation of the Kerrighed system and demonstrate the feasibility of the single system image approach at the kernel level.
This paper presents a communication system designed to allow efficient process migration in a cluster. The proposed system is generic enough to allow the migration of any kind of stream: socket, pipe, char devices. Communicating processes using IP or Unix sockets are transparently migrated with our mechanisms and they can still efficiently communicate after migration. The designed communication system is implemented as part of Kerrighed, a single system image operating system for a cluster based on Linux. Preliminary performance results are presented.
Systems for the archival and retrieval of images are used in many areas, for example medical applications or news agencies. The technique of dynamic image analysis and comparison enables a detailed search for important image elements such as persons and objects. But it also requires large computational resources. Therefore, we developed a cluster-based architecture for an efficient storage and comparison of archived images. The initially even, size-based distribution of images over the cluster nodes is distorted, when the user excludes images from further consideration by applying combined a-priori and dynamically extracted features. This creates the necessity of utilisation of workload balancing strategies, which consider the specific requirements of the image retrieval problem. For efficiently solving this problem we developed a formal problem specification and three strategies – LPTR, RBS and Simulated Annealing – for workload balancing. The reduction of the system response time due to successful load balancing is evaluated by series of experimental measurements.
Over recent years, non-rigid registration has become a major issue in medical imaging. It consists in recovering a dense point-to-point correspondence field between two images, and usually takes a long time. This is in contrast to the needs of a clinical environment, where usability and speed are major constraints, leading to the necessity of reducing the computation time from slightly less than an hour to just a few minutes. As financial pressure makes it hard for healthcare organizations to invest in expensive high-performance computing (HPC) solutions, cluster computing proves to be a convenient solution to our computation needs, offering a large processing power at a low cost. Among the fast and efficient non-rigid registration methods, we chose the demons algorithm for its simplicity and good performances. The parallel implementation decomposes the correspondence field into spatial blocks, each block being assigned to a node of the cluster. We obtained an acceleration of 11 by using 15 2GHz PC's connected through a 1GB/s Ethernet network and reduced the computation time from 40min to 3min30. In order to further optimize the costs and the maintenance load, we investigate in the second part the transparent use of shared computing resources, either through a graphic client or a Web one.
Cluster computing has come to prominence as a cost-effective parallel processing tool for solving many complex computational problems. In this paper, we propose a new timesharing opportunistic scheduling policy to support remote batch job executions over networked clusters to be used in conjunction with the Condor Up-Down scheduling algorithm. We show that timesharing approaches can be used in an opportunistic setting to improve both mean job slowdowns and mean response times with little or no throughput reduction. We also show that the proposed algorithm achieves significant improvement in job response time and slowdown as compared to exiting approaches and some recently proposed new approaches.
The study presented in this paper highlights an important issue that was subject for discussions and research about a decade ago and now have gained new interest with the current advances of grid computing and desktop grids. New techniques are being invented on how to utilize desktop computers for computational tasks but no other study, to our knowledge, has explored the availability of the said resources. The general assumption has been that there are resources and that they are available. The study is based on a survey on the availability of resources in an ordinary office environment. The aim of the study was to determine if there are truly usable under-utilized networked desktop computers available for non-desktop tasks during the off-hours. We found that in more than 96% of the cases the computers in the current investigation was available for the formation of part-time (night and weekend) computer clusters. Finally we compare the performance of a full time and a metamorphosic cluster, based on one hypothetical linear scalable application and a real world welding simulation.
STRIKE is an algorithm which predicts protein–protein interactions (PPIs) and determines that proteins interact if they contain similar substrings of amino acids. Unlike other methods for PPI prediction, STRIKE is able to achieve reasonable improvement over the existing PPI prediction methods. Although its high accuracy as a PPI prediction method, STRIKE consumes a large execution time and hence it is considered to be a compute-intensive application. In this paper, we develop and implement a parallel STRIKE algorithm for high-performance computing (HPC) systems. Using a large-scale cluster, the execution time of the parallel implementation of this bioinformatics algorithm was reduced from about a week on a serial uniprocessor machine to about 16.5 h on 16 computing nodes, down to about 2 h on 128 parallel nodes. Communication overheads between nodes are thoroughly studied.
Adaptive techniques can be applied to improve performance of a beamformer in a cluttered environment. The sequential implementation of an adaptive beamformer, for many sensors and over a wide band of frequencies, presents a serious computational challenge. By coupling each transducer node with a microprocessor, in-situ parallel processing applied to an adaptive beamformer on a distributed system can glean advantages in execution speed, fault tolerance, scalability, and cost. In this paper, parallel algorithms for Subspace Projection Beamforming (SPB), using QR decomposition on distributed systems, are introduced for in-situ signal processing. Performance results from parallel and sequential algorithms are presented using a distributed system testbed comprised of a cluster of computers connected by a network. The execution times, parallel efficiencies, and memory requirements of each parallel algorithm are presented and analyzed. The results of these analyses demonstrate that parallel in-situ processing holds the potential to meet the needs of future advanced beamforming algorithms in a scalable fashion.
Continuous innovations in adaptive matched-field processing (MFP) algorithms have presented significant increases in computational complexity and resource requirements that make development and use of advanced parallel processing techniques imperative. In real-time sonar systems operating in severe underwater environments, there is a high likelihood of some part of systems exhibiting defective behavior, resulting in loss of critical network, processor, and sensor elements, and degradation in beam power pattern. Such real-time sonar systems require high reliability to overcome these challenging problems. In this paper, efficient fault-tolerant parallel algorithms based on coarse-grained domain decomposition methods are developed in order to meet real-time and reliability requirements on distributed array systems in the presence of processor and sensor element failures. The performance of the fault-tolerant parallel algorithms is experimentally analyzed in terms of beamforming performance, computation time, speedup, and parallel efficiency on a distributed testbed. The performance results demonstrate that these fault-tolerant parallel algorithms can provide real-time, scalable, lightweight, and fault-tolerant implementations for adaptive MFP algorithms on distributed array systems.
Parallel execution is a very efficient means of processing vast amounts of data in a small amount of time. Creating parallel applications has never been easy, and requires much knowledge of the task and the execution environment used to execute parallel processes. The process of creating parallel applications can be made easier through using a compiler that automatically parallelises a supplied application. Executing the parallel application is also simplified when a well designed execution environment is used. Such an execution environment provides very powerful operations to the programmer transparently. Combining both a parallelising compiler and execution environment and providing a fully automated parallelisation and execution tool is the aim of this research. The advantage of using such a fully automated tool is that the user does not need to provide any additional input to gain the benefits of parallel execution. This report shows the tool and how it transparently supports the programmer creating parallel applications and supports their execution.
In this paper, we introduce SPM (Software-built Parallel Machines), a model to create software based virtual parallel machines. With SPM, an application developer simply selects all the required virtual parallel machines from the repository and implements the intended parallel algorithms directly without any need of complex mappings, as if the required processor interconnections are readily available. In addition, we present an implementation of the SPM model, which provides a systematic way to design new virtual machines. Our experiments show that the applications developed using the SPM model and tools give excellent performance, as compared to the applications developed using a generic communication library, such as MPI.
Online analytical processing (OLAP) queries normally incur enormous processing overheads due to the huge size of data warehouses. This results in unacceptable response times. Parallel processing using a cluster of workstations has of late emerged as a practical solution to many compute and data intensive problems. In this article, we present parallel algorithms for some of the OLAP operators. We have implemented these parallel solutions for a data warehouse implemented on Oracle hosted in a cluster of workstations. Our performance studies show that encouraging speedups are achieved.
A grid is a geographically distributed resource sharing environment across multiple organizations. The most typical grid resources are clusters with high performance/cost ratio. In general, these clusters are shared as non-dedicated grid resources since local users may run their jobs simultaneously. Local jobs are usually queued and processed in a batch mode with uncertain waiting time, while grid jobs always require advance reservations with guaranteed resource allocation.
In this paper, we provide quantitative analysis on the impact of advance reservations over queued jobs, in terms of job waiting time and resource utilization, respectively. It is observed that advance reservations will lead to longer job waiting time and lower resource utilization. That is to say, advance reservations should cost more than queued jobs. In this work, based on quantitative experimental results, an empirical formula for cost estimation of advance reservations over queued jobs is presented. It is suggested that compared with queued jobs, advance reservations should be doubly charged to compensate resource utilization loss. If the notice time of an advance reservation is short below a threshold, additional cost should be applied further since queue waiting time is increased.
Please login to be able to save your searches and receive alerts for new content matching your search criteria.