Please login to be able to save your searches and receive alerts for new content matching your search criteria.
In this paper, we propose different approaches for the parallel loop scheduling problem on distributed as well as shared memory systems. Specifically, we propose adaptive loop scheduling models in order to achieve load balancing, low runtime scheduling, low synchronization overhead and low communication overhead. Our models are based on an adaptive determination of the chunk size and an exploitation of the processor affinity property, and consider different situations (central or local queues, and dynamic or static loop partition).
Most space-sharing parallel computers presently operated by production high-performance computing centers use batch-queuing systems to manage processor allocation. In many cases, users wishing to use these batch-queued resources may choose among different queues (charging different amounts) potentially on a number of machines to which they have access. In such a situation, the amount of time a user's job will wait in any one batch queue can be a significant portion of the overall time from job submission to job completion. It thus becomes desirable to provide a prediction for the amount of time a given job can expect to wait in the queue. Further, it is natural to expect that attributes of an incoming job, specifically the number of processors requested and the amount of time requested, might impact that job's wait time.
In this work, we explore the possibility of generating accurate predictions by automatically grouping jobs having similar attributes using model-based clustering. Moreover, we implement this clustering technique for a time series of jobs so that predictions of future wait times can be generated in real time.
Using trace-based simulation on data from 7 machines over a 9-year period from across the country, comprising over one million job records, we show that clustering either by requested time, requested number of processors, or the product of the two generally produces more accurate predictions than earlier, more naive, approaches and that automatic clustering outperforms administrator-determined clustering.
Open MPI's point-to-point communications abstractions, described in this paper, handle several different communications scenarios, with a portable, high-performance design and implementation. These abstractions support two types of low-level communication protocols – general purpose point-to-point communications, like the OpenIB interface, and MPI-like interfaces, such as Myricom's MX library. Support for the first type of protocols makes use of all communications resources available to a given application run, with optional support for communications error recovery. The latter provides a interface layer, relying on the communications library to guarantee correct MPI message ordering and matching. This paper describes the three point-to-point communications protocols currently supported in the Open MPI implementation, supported with performance data. This includes comparisons with other MPI implementations using the OpenIB, MX, and GM communications libraries.
The Microgrid is a many-core architecture comprising multiple clusters of fine-grained multi-threaded cores. The SVP API supported by the cores allows for the asynchronous delegation of work to different clusters of cores which can be acquired dynamically. We want to explore the execution of complex applications and their interaction with dynamically allocated resources. To date, any evaluation of the Microgrid has used a detailed emulation with a cycle accurate simulation of the execution time. Although the emulator can be used to evaluate small program kernels, it only executes at a rate of 100K instructions per second, divided over the number of emulated cores. This makes it inefficient to evaluate a complex application executing on many cores using dynamic allocation of clusters. In order to obtain a more efficient evaluation we have developed a co-simulation environment that executes high level SVP control code but which abstracts the scheduling of the low-level threads using two different techniques. The co-simulation is evaluated for both performance and simulation accuracy.
In parallel systems, a number of joins from one or more queries can be executed either serially or in parallel. While serial execution assigns all processors to execute each join one after another, parallel execution distributes the joins to clusters formed by certain numbers of processors and executes them concurrently. However, data skew may result in load imbalance among processors executing the same join and some clusters may be overloaded with more expensive joins. As a result, the completion time will be much longer than what is expected. In this paper, we propose an algorithm to further minimize the completion time of concurrently executed multiple joins. For this algorithm, all the joins to be executed concurrently are decomposed into a set of tasks that are ordered according to decreasing task size. These tasks are dynamically acquired by available processors during execution. Our performance study shows that the proposed algorithm outperforms previously proposed approaches, especially when the number of processors increases, the relations are highly skewed and relation sizes are large.
In many settings, systems are composed of a group of independent sub-units. Each sub-unit produces the same set of outputs by consuming the same set of inputs. Conventional data envelopment analysis (DEA) views such a system as a "black-box", and uses the sum of the respective inputs and outputs of all relevant component units to calculate the system efficiency. Various DEA-based models have been developed for decomposing the overall efficiency. This paper further investigates this kind of structure by using the cooperative (or centralized) and non-cooperative (Stackelberg or leader–follower) game theory concepts. We show that the existing DEA approaches can be viewed as a centralized model that optimizes the efficiency scores of all sub-units jointly. The proposed leader–follower model will be useful when the priority sequence is available for sub-units. Consider, for example, the evaluation of relative efficiencies of a set of manufacturing facilities where multiple work shifts are operating. Management may wish to determine not only the overall plant efficiency, but as well, the performance of each shift in some priority sequence. The relationship between the system efficiency and component efficiencies is also explored. Our approaches are demonstrated with an example whose data set involves the national forests of Taiwan.
In massively parallel systems, the performance gains are often significantly diminished by the inherent communication overhead. This overhead is caused by the required message passing resulting from the task allocation scheme. In this paper, techniques to reduce this communication overhead by both scheduling the communication and determining the routing that the messages should take within a tightly-coupled processor network are presented. Using the recently developed Collision Graph model, static scheduling algorithms are derived which work at compile-time to determine the ordering and routing of the individual message transmissions. Since a priori knowledge about the network traffic required by static scheduling may not be available or accurate, this work also considers dynamic scheduling. A novel hybrid technique is presented which operates in a dynamic environment yet uses known information obtained by analyzing the communication patterns. Experiments performed show significant improvement over baseline techniques.
We provide the exact expression of the reliability of a system under a Bayesian approach, using beta distributions as both native and induced priors at the system level, and allowing uncertainties in sampling, expressed under the form of misclassifications, or noises, that can affect the final posterior distribution. Exact 100(1-α)% highest posterior density credible intervals, for system reliability, are computed, and comparisons are made with results from approximate methods proposed in the literature.
The ability to accurately determine the temporal safe region in time-variant reliability analysis is seminal for reliability-based design. When stochastic excitations are present and discrete-time approaches are invoked, the errors can be large when one uses only one past safe event (and one new failure event) at each time-step. Furthermore, when all previous safe events are accumulated and used, the calculations can be time consuming and the accuracy not ensured. In this paper, a minimal, or a so-called extreme limit-state, surface is obtained to identify the system temporal safe region in an economical manner. To do this, the limit-state surface motion for each failure mode is recorded as a parametric polar plot that provides both magnitude and relative angle of the vectors from the origin to the most-likely failure points (MLFPs) in standard normal space. The angle differences provide correlation and the magnitude differences provide importance. At the component-level, a few logical policies that compare correlation and the magnitude ensure that the safe region is sufficiently recognized. At the system-level, the temporal average of correlations and the magnitudes at the component-level, along with series or parallel system designations, foretells which failure modes are needed to form the system extreme limit-state surface. The impact of the work includes an immediate recognition of the important failure modes and reduced computation for methods such as multi-normal integration. Case studies of both series-system reliability and parallel-system reliability are presented using structural beams excited by stochastic loads and plagued with degrading material properties and dimensions. The accuracy of the extreme LSS is demonstrated cogently. The use of the polar plots as a design tool becomes evident.
A Bayesian procedure for estimating the reliability of a complex system of independent series and parallel subsystems is presented. This method accepts either binomial or Poisson test data (perhaps both or neither) as inputs, as well as prior information, at any and all levels in the system (such as the component, subsystem and system levels). Natural conjugate beta and gamma priors are considered. A numerical example concerning the unreliability of the low-pressure coolant injection system of a certain U.S. commercial nuclear-powered boiling water reactor is used to illustrate the procedure. A Mathematica® notebook is available for implementing the method.
A few models in data envelopment analysis (DEA) have been proposed to evaluate performance of production systems which are composed of independent parallel production units. In this paper, we have developed a parallel DEA model for production systems which are composed of independent parallel subunits, so that each subunit uses its own inputs and a portion of the shared inputs to produce final outputs. Also, we will propose the models to calculate the efficiency of the whole system along with the efficiencies of its subunits. In addition to theoretical derivations, a case of 25 areas of Iran's Bank is used as an example to illustrate the whole idea.