Processing math: 100%
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

SEARCH GUIDE  Download Search Tip PDF File

  Bestsellers

  • articleNo Access

    LOAD BALANCING: A PROGRAMMER’S APPROACH OR THE IMPACT OF TASK-LENGTH PARAMETERS ON THE LOAD BALANCING PERFORMANCE OF PARALLEL PROGRAMS

    We consider the problem of dynamic load balancing in an n processor parallel system. The scheduling process of a parallel program is modeled by randomly throwing weighted balls into n holes. For a given program A, the ball weights (task lengths) are chosen according to a probability distribution formula, for which we know only some of the following parameters: the expectation μ, variance σ2, maximum M and minimum m. From these parameters, we derive an upper bound for the number of tasks to be generated by A in order to achieve a load balancing ratio for which the run-time is optimal up to a factor (1+ε)2 for any 0<ε≤0.5, with very high probability. Using the derived relations, the programmer may control the load-balancing of his program by tuning the global parameters of the generated tasks. This can be done regardless of the underlying scheduler used by the parallel machine. We also give experimental results of marine-life simulation in support of our claims.

  • articleNo Access

    LOAD BALANCING IN HOME-BASED SOFTWARE DSMS

    Load balancing is a critical issue for achieving good performance in parallel and distributed systems. However, this issue is neglected in the research area of software DSMs in the past decade. Based on the observation that scientific applications can be classified into two categories: iterative and non-iterative, we propose two dynamic scheduling schemes for these two cases respectively in this paper. For iterative scientific applications, a dynamic task migration technique is proposed which characterizes itself with integrating computation migration and data migration together. An affinity-based self scheduling (ABS) is proposed for non-iterative scientific applications, which take both the static and dynamic processor affinity into consideration when scheduling. The target experiment platform is a state-of-the-art home-based DSM system named JIAJIA. Performance evaluation results show that the novel task migration scheme improves the performance ranging from 36% to 50% compared with a static task allocation scheme in a metacomputing environment, and performs better than traditional task (computation-only) migration approach about 12.5% for MAT, and 37.5% for SOR and EM3D. Higher resource utilization is achieved via the new task migration scheme too. In comparison with other loop scheduling schemes, the ABS achieves the best performance among all scheduling schemes in a metacomputing environment because of the reduction of synchronization overhead and the great improvement of waiting time resulting from load imbalance.

  • articleNo Access

    Graph Orientation with Edge Modifications

    The goal of an outdegree-constrained edge-modification problem is to find a spanning subgraph or supergraph H of an input undirected graph G such that either: (Type I) the number of edges in H is minimized or maximized and H can be oriented to satisfy some specified constraints on the vertices’ resulting outdegrees; or: (Type II) among all subgraphs or supergraphs of G that can be constructed by deleting or inserting a fixed number of edges, H admits an orientation optimizing some objective involving the vertices’ outdegrees. This paper introduces eight new outdegree-constrained edge-modification problems related to load balancing called (Type I) MIN-DEL-MAX, MIN-INS-MIN, MAX-INS-MAX, and MAX-DEL-MIN and (Type II)p-DEL-MAX, p-INS-MIN, p-INS-MAX, and p-DEL-MIN. In each of the eight problems, the input is a graph and the goal is to delete or insert edges so that the resulting graph has an orientation in which the maximum outdegree (taken over all vertices) is small or the minimum outdegree is large. We first present a framework that provides algorithms for solving all eight problems in polynomial time on unweighted graphs. Next we investigate the inapproximability of the edge-weighted versions of the problems, and design polynomial-time algorithms for six of the problems on edge-weighted trees.

  • articleNo Access

    A Scalable Diffusion Algorithm for Dynamic Mapping and Load Balancing on Networks of Arbitrary Topology

    The problems of mapping and load balancing applications on arbitrary networks are considered. A novel diffusion algorithm is presented to solve the mapping problem. It complements the well known diffusion algorithms for load balancing which have enjoyed success on massively parallel computers (MPPs). Mapping is more difficult on interconnection networks than on MPPs because of the variations which occur in network topology. Popular mapping algorithms for MPPs which depend on recursive topologies are not applicable to irregular networks. The most celebrated of these MPP algorithms use information from the Laplacian matrix of a graph of communicating processes. The diffusion algorithm presented in this paper is also derived from this Laplacian matrix. The diffusion algorithm works on arbitrary network topologies and is dramatically faster than the celebrated MPP algorithms. It is delay and fault tolerant. Time to convergence depends on initial conditions and is insensitive to problem scale. This excellent scalability, among other features, makes the diffusion algorithm a viable candidate for dynamically mapping and load balancing not only existing MPP systems but also large distributed systems like the Internet, small cluster computers, and networks of workstations.

  • articleNo Access

    AN EXTENDED DIMENSION ORDER TOKEN DISTRIBUTION ALGORITHM ON k-Ary d-CUBES AND ITS COMPLEXITY

    Parallel programs that dynamically generate data generally need good load balancing algorithms. The token distribution problem is a generalization of the static load balancing problem. We solve this problem by an algorithm, called dimension order token distribution algorithm, designed for k-ary d-cube topologies. The algorithm is formally stated and proved correct. We analyze its message-passing and computational complexities and show that it is optimal. We also discuss generalizations of this algorithm, showing how the algorithm can be adapted so as to handle tokens of different kinds or so as to let the sender of a token know the destination of the migrating token.

  • articleNo Access

    DETERMINISTIC BRANCH-AND-BOUND ON DISTRIBUTED MEMORY MACHINES

    The branch-and-bound problem involves determining the leaf of minimum cost in a cost-labelled, heap-ordered tree, subject to the constraint that only the root is known initially and that the children of a node are revealed only by visiting their parent. We present the first efficient deterministic algorithm to solve the branch-and-bound problem for a tree T of constant degree on a p-procesor distributed-memory Optically Connected Parallel Computer (OCPC). Let c* be the cost of the minimum-cost leaf in T, and let n and h be the number of nodes and the height, respectively, of the subtree T*⊆T of nodes with cost at most c*. When according for both computation and communication costs, our algorithm runs in time O(n/p+h(max{p,log nlog p})2) for general values of n, and can be made to run in time O((n/p+hlog4p)log log p) for n polynomial in p. For large ranges of the relevant parameters, our algorithm is provably optimal and improves asymptotically upon the well-known randomized strategy by Karp and Zhang.

  • articleNo Access

    MOLECULAR DYNAMICS SIMULATIONS OF GRANULAR MATERIALS ON THE INTEL iPSC/860

    In this paper we present an efficient algorithm to perform Molecular Dynamics simulations on a distributed memory parallel computer, the Intel iPSC/860. The proposed model describes the flow properties of granular materials in two dimensions. The specific implementation on a 32 node iPSC/860, especially the message passing and load balancing algorithms, are discussed in detail. Performance data are shown for different computers and varying node numbers of the iPSC/860. As a physical example we calculate some properties of the outflow behavior from a two-dimensional hopper and we discuss possible extensions of our model to three dimensions. Our simulations show that Molecular Dynamics simulations can be implemented quite efficiently on a distributed memory parallel computer if one assures load balancing and optimizes the internode communications.

  • articleNo Access

    Load Balancing Using Processor Groups

    This paper defines a group partitioning model with the view to improving the load balancing in a distributed system. Using modelling and simulation, we analyze the impact of this new partitioning technique on load balancing strategies. We show that a strategy based on group preference gives efficient results.

  • articleNo Access

    LOCALITY-PRESERVING LOAD-BALANCING MECHANISMS FOR SYNCHRONOUS SIMULATIONS ON SHARED-MEMORY MULTIPROCESSORS

    In the past decade, many synchronous algorithms have been proposed for parallel and discrete simulations. However, the actual performance of these algorithms have been far from ideal, especially when event granularity is small. Barring the case of low parallelism in the given simulation models, one of the main reasons of low speedups is in the uneven load distribution among processors. To amend for this, both static and dynamic load balancing approaches have been proposed. Nevertheless, static schemes based on partitioning of LPs are often subject to the dynamic behavior of the specific simulation models and are therefore application dependent; dynamic load balancing schemes, on the other hand, often suffer from loss of localities and hence cache misses, which could severely penalize on fine-grained event processing. In this paper, we present several new locality-preserving load balancing mechanisms for synchronous simulations on shared-memory multiprocessors. We focus on the type of synchronous simulations where the number of LPs to be processed within a cycle decreases monotonically. We show both theoretically and empirically that some of these mechanisms incur very low overhead. The mechanisms have been implemented by using MIT's Cilk and tested with a number of simulation applications. The results confirm that one of the new mechanisms is indeed more efficient and scalable than common existing approaches.

  • articleNo Access

    COMPILING DATA-PARALLEL PROGRAMS TO A DISTRIBUTED RUNTIME ENVIRONMENT WITH THREAD ISOMIGRATION

    The compilation of data-parallel languages is traditionally targeted to low-level runtime environments: abstract processors are mapped onto static system processes, which directly address the low-level communication library. Alternatively, we propose to map each HPF abstract processor onto a "lightweight process" (thread) which can be dynamically migrated between nodes together with the data it manages, under the supervision of some external scheduler. We discuss the pros and cons of such an approach and the facilities which must be provided by the multithreaded runtime. We describe a prototype HPF compiling system built along these lines, based on the Adaptor HPF compiler and using the PM2 multithreaded runtime environment.

  • articleNo Access

    Optimal Diffusion Schemes and Load Balancing on Product Graphs

    We discuss nearest neighbor load balancing schemes on processor networks which are represented by a cartesian product of graphs and present a new optimal diffusion scheme for general graphs. In the first part of the paper, we introduce the Alternating-Direction load balancing scheme, which reduces the number of load balance iterations by a factor of 2 for cartesian products of graphs. The resulting flow is theoretically analyzed and can be very high for certain cases. Therefore, we further present the Mixed-Direction scheme which needs the same number of iterations but computes in most cases a much smaller flow. In the second part of the paper, we present a simple optimal diffusion scheme for general graphs, calculating a balancing flow which is minimal in the l2 norm. It is based on the spectra of the graph representing the network and needs only m-1 iterations to balance the load with m being the number of distinct eigenvalues. Known optimal diffusion schemes have the same performance, however the optimal scheme presented in this paper can be implemented in a very simple manner. The number of iterations of optimal diffusion schemes is independent of the load scenario and, thus, they are practical for networks which represent graphs with known spectra. Finally, our experiments exhibit that the new optimal scheme can successfully be combined with the Alternating-Direction and Mixed-Direction schemes for efficient load balancing on product graphs.

  • articleNo Access

    A Truthful Load Balancing Mechanism with Verification

    In this paper we investigate the problem of designing load balancing protocols in distributed systems involving self-interested participants. These participants have their own requirements and objectives and no a-priori motivation for cooperation. Their selfish behavior may lead to poor performance and inefficiency. To address this problem we design a load balancing mechanism with verification that provides incentives to participants to report their true parameters and follow the given algorithm. We prove that our load balancing mechanism is truthful (i.e., agents will be better off by reporting their true parameters) and satisfies the voluntary participation condition (i.e., truthful agents never incur a loss). We present a simulation study to show the performance of our load balancing mechanism.

  • articleNo Access

    A 1D CELLULAR AUTOMATON THAT MOVES PARTICLES UNTIL REGULAR SPATIAL PLACEMENT

    We consider a finite cellular automaton with particles where each site can host at most one particle. Starting from an arbitrary initial configuration, our goal is to move the particles between neighbor sites until the distance to the nearest particle is minimized globally. Such a configuration corresponds in fact to a regular placement. This problem is a cellular automata equivalent of load-balancing in parallel computing, where each task is a particle and each processor a connected set of sites. We present a cellular automata rule that solves this problem in the 1D case, and is convergent, i.e. once the regular placement is achieved, the configuration does not change anymore. The rule is inspired from the Lloyd algorithm, computing a centroidal Voronoi tessellation. The dynamic of the rule is described at a higher level, using self-explanatory space-time diagrams. They exhibit signals acting as quantity of movement carrying the energy of system. Each signal bounces or pass through particles, causing their movement, until it eventually reaches the border and vanishes. When signals have all vanished, particles are regularly placed.

  • articleNo Access

    OBSERVING THE IMPACT OF MULTIPLE METRICS AND RUNTIME ADAPTATIONS ON BSP PROCESS RESCHEDULING

    Process rescheduling is an useful mechanism to offer runtime load balancing, mainly in dynamic and heterogeneous environments. In this context, we developed a model called MigBSP which controls the process migration on BSP (Bulk Synchronous Parallel) applications. A BSP application is divided in one or more supersteps, each one containing both computation and communication phases followed by a barrier synchronization. Since the barrier waits for the slowest process, MigBSP's final objective is to adjust the processes location in order to reduce the supersteps' times. Its novel ideas are twofold. The former is represented by the combination of three metrics - Memory, Computation and Communication - in order to measure the Potential of Migration of each BSP process. The second idea consists in offering efficient adaptations that work on the rescheduling frequency. Both ideas turn MigBSP a viable model for getting performance on BSP applications. Meanwhile, it provides a low overhead on application execution when migrations do not take place. This paper presents MigBSP's algorithms, the parallel machine organization, some experimental results and related work.

  • articleNo Access

    ASYMPTOTIC PEAK UTILISATION IN HETEROGENEOUS PARALLEL CPU/GPU PIPELINES: A DECENTRALISED QUEUE MONITORING STRATEGY

    Widespread heterogeneous parallelism is unavoidable given the emergence of General-Purpose computing on graphics processing units (GPGPU). The characteristics of a Graphics Processing Unit (GPU)—including significant memory transfer latency and complex performance characteristics—demand new approaches to ensuring that all available computational resources are efficiently utilised. This paper considers the simple case of a divisible workload based on widely-used numerical linear algebra routines and the challenges that prevent efficient use of all resources available to a naive SPMD application using the GPU as an accelerator. We suggest a possible queue monitoring strategy that facilitates resource usage with a view to balancing the CPU/GPU utilisation for applications that fit the pipeline parallel architectural pattern on heterogeneous multicore/multi-node CPU and GPU systems. We propose a stochastic allocation technique that may serve as a foundation for heuristic approaches to balancing CPU/GPU workloads.

  • articleNo Access

    BaPipe: Balanced Pipeline Parallelism for DNN Training

    The size of deep neural networks (DNNs) grows rapidly as the complexity of the machine learning algorithm increases. Distributed deep learning based on model parallelism has been widely used to satisfy the requirements of DNN training related to computation and memory. In this paper, we propose a training framework for pipeline parallelism called BaPipe (Balanced Pipeline) that can automatically explore methods to schedule pipeline parallelism and balanced partition strategies for DNN training on heterogeneous accelerator clusters. In BaPipe, each accelerator calculates the forward and backward propagation for the assigned partition of networks to implement an intra-batch pipeline parallelism strategy. By considering the parameters of DNN models as well as the computation, memory, and communication resources of each accelerator, BaPipe automatically selects the most suitable method of pipeline scheduling from among multiple proposed scheduling modes. It also uses a novel strategy to automatically investigate load balancing in the context of inter-layer partition, intra-layer partition, and coarse-grained partition. We trained such DNNs as VGG-16, ResNet-50, and Google’s Neural Machine Translation (GNMT) on GPU clusters, and simulated the training-related performance of FPGA clusters. Compared with the state-of-the-art frameworks for data parallelism (DP) and pipeline parallelism, BaPipe provides a speedup of 3.2× and 4× of memory reduction on various homogeneous and heterogeneous platforms.

  • articleFree Access

    Fast Local Rules Based Switching and Routing Within Multidimensional Torus Interconnect

    Multidimensional torus networking topology has become widespread recently in the domain of high-performance computers, clusters, and grids, as well as in the domain of networks on chip. Torus represents an ideal communication structure with the shortest distance and multitude of alternative shortest paths between a pair of nodes. We study simple and powerful local packet forwarding (switching) rules that provide packet delivery with quasi-optimal load balancing and do not use tables of addresses (routes). Implementation of these rules in the form of micro-program code within switching nodes increases considerably network performance, security, and QoS. We use infinite Petri nets and reenterable models in the form of colored Petri nets for prototyping the multi-dimensional torus interconnect simulator to study and compare the packet forwarding rules. Then, an ad-hoc simulator of torus interconnect ts is implemented in the C language to provide high performance and the possibility of simulation over prolonged intervals of time. The simulation results acknowledge the advantages of local packet forwarding rules.

  • articleNo Access

    ON LOWER BOUNDS FOR THE COMMUNICATION VOLUME IN DISTRIBUTED SYSTEMS

    In this paper we derive a lower bound for the total communication volume when mapping arbitrary task graphs onto a distributed processor system. For a K processor system this lower bound can be computed with only the K (possibly) largest eigen values of the adjacency matrix of the task graph and the eigen values of the adjacency matrix of the processor graph. We also derive the eigen values of the adjacency matrix of the processor graph for a hypercube multiprocessor and illustrate the concept with a simple example for the two processor case.

  • articleNo Access

    OPTIMAL PARAMETERS FOR LOAD BALANCING WITH THE DIFFUSION METHOD IN MESH NETWORKS

    The diffusion method is a simple distributed load balancing method for distributed memory multiprocessors. It operates in a relaxation fashion for point-to-point networks. Its convergence to the balanced state relies on the value of a parameter—the diffusion parameter. An optimal diffusion parameter would lead to the fastest convergence of the method. Previous results on optimal parameters have existed for the k-ary n-cube and the torus. In this paper, we derive optimal diffusion parameters for mesh networks.

  • articleNo Access

    PARALLEL INCREMENTAL SCHEDULING

    Parallel incremental scheduling is a new approach for load balancing. In parallel scheduling, all processors cooperate together to balance the workload. Parallel scheduling accurately balances the load by using global load information. In incremental scheduling, the system scheduling activity alternates with the underlying computation work. This paper provides an overview of parallel incremental scheduling. Different categories of parallel incremental scheduling are discussed.